### Install string2string and scikit-learn Source: https://string2string.readthedocs.io/en/latest/hupd_example.html Install the necessary libraries using pip. This is a prerequisite for running the tutorial. ```bash %%capture !pip install string2string !pip install scikit-learn ``` -------------------------------- ### Install string2string Library Source: https://string2string.readthedocs.io/en/latest/index.html Install the string2string library using pip. Python 3.7+ is recommended. ```bash pip install string2string ``` -------------------------------- ### Install string2string and other libraries Source: https://string2string.readthedocs.io/en/latest/plagiarism_detection.html Installs the string2string library along with scikit-learn and networkx for data processing and analysis. ```bash %%capture !pip install string2string !pip install scikit-learn !pip install networkx ``` -------------------------------- ### Initialize FaissSearch with OPT-125M Model Source: https://string2string.readthedocs.io/en/latest/plagiarism_detection.html Instantiate the FaissSearch tool using a pre-trained model from Hugging Face. Ensure the 'transformers' library is installed. ```python # Let's download OPT-125M from Facebook using HuggingFace's transformers library model_name = 'facebook/opt-125m' faiss_search = FaissSearch(model_name_or_path = model_name) ``` -------------------------------- ### Initialize FaissSearch Wrapper Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/faiss_search.html Initializes the FaissSearch wrapper with a specified model, tokenizer, and device. Ensure Faiss is installed and cite the relevant paper and GitHub repository if using this class. ```python from string2string.search.faiss_search import FaissSearch # Initialize with default model and device searcher = FaissSearch() # Initialize with a specific model and device # searcher = FaissSearch(model_name_or_path='sentence-transformers/all-MiniLM-L6-v2', device='cuda') ``` -------------------------------- ### Longest Common Substring Computation Setup Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/alignment/classical.html Sets up the computation for the longest common substring by determining if inputs are lists and initializing the distance matrix. ```python boolList = False if isinstance(str1, list) and isinstance(str2, list): boolList = True # Lengths of strings str1 and str2, respectively. n = len(str1) m = len(str2) # Initialize the distance matrix. dist = np.zeros((n + 1, m + 1), dtype=int) ``` -------------------------------- ### Initialize and Compute sacreBLEU Score Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/metrics/sbleu.html Demonstrates how to initialize the sacreBLEU class and compute the BLEU score for a given set of predictions and references. It shows an example of calling the compute method with default parameters. ```python from typing import Union, Optional, List, Dict from string2string.misc.default_tokenizer import Tokenizer from sacrebleu import corpus_bleu ALLOWED_TOKENIZERS = { 'none': 'tokenizer_none.NoneTokenizer', 'zh': 'tokenizer_zh.TokenizerZh', '13a': 'tokenizer_13a.Tokenizer13a', 'intl': 'tokenizer_intl.TokenizerV14International', 'char': 'tokenizer_char.TokenizerChar', 'ja-mecab': 'tokenizer_ja_mecab.TokenizerJaMecab', 'ko-mecab': 'tokenizer_ko_mecab.TokenizerKoMecab', 'spm': 'tokenizer_spm.TokenizerSPM', 'flores101': 'tokenizer_spm.Flores101Tokenizer', 'flores200': 'tokenizer_spm.Flores200Tokenizer', } class sacreBLEU: """ This class contains the sacreBLEU metric. """ def __init__(self) -> None: """ Initializes the BLEU class. """ pass def compute(self, predictions: List[str], references: List[List[str]], smooth_method: str = 'exp', smooth_value: Optional[float] = None, lowercase: bool = False, tokenizer_name: Optional[str] = 'none', use_effective_order: bool = False, return_only: List[str] = ['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len'] ): """ Returns the BLEU score between a list of predictions and list of list of references. Arguments: predictions (List[str]): The predictions. references (List[List[str]]): The references (or ground truth strings). smooth_method (str): The smoothing method. Default is "exp". Other options are "floor", "add-k" and "none". smooth_value (Optional[float]): The smoothing value for floor and add-k smoothing. Default is None. lowercase (bool): Whether to lowercase the text. Default is False. tokenizer_name (str): The tokenizer name. Default is "none". Other options are "zh", "13a", "intl", "char", "ja-mecab", "ko-mecab", "spm", "flores101" and "flores200". use_effective_order (bool): Whether to use the effective order. Default is False. return_only (Optional[List[str]]): The list of BLEU score components to return. Default is ['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len']. Returns: Dict[str, float]: The BLEU score (between 0 and 1). Raises: ValueError: If the number of predictions does not match the number of references. ValueError: If the tokenizer name is invalid. """ # Check that the number of predictions matches the number of references if len(predictions) != len(references): raise ValueError('The number of predictions does not match the number of references.') # Check that the tokenizer name is valid if tokenizer_name not in ALLOWED_TOKENIZERS: raise ValueError('The tokenizer name is invalid.') # Check that the size of each reference list is the same reference_size = len(references[0]) for reference in references: if len(reference) != reference_size: raise ValueError('The size of each reference list is not the same.') # Transform the references into a list of list of references. # This is necessary because sacrebleu.corpus_bleu expects a list of list of references. transformed_references = [[refs[i] for refs in references] for i in range(reference_size)] # Compute the BLEU score using sacrebleu.corpus_bleu # This function returns "BLEUScore(score, correct, total, precisions, bp, sys_len, ref_len)" bleu_score = corpus_bleu( hypotheses=predictions, references=transformed_references, smooth_method=smooth_method, smooth_value=smooth_value, lowercase=lowercase, use_effective_order=use_effective_order, **(dict(tokenize=ALLOWED_TOKENIZERS[tokenizer_name]) if tokenizer_name != 'none' else {}), ) # Get a summary of all the relevant BLEU score components final_scores = {k: getattr(bleu_score, k) for k in return_only} # Return the BLEU score return final_scores # predictions = ["hello there general kenobi", "foo bar foobar"] # references = [["hello there general kenobi", "hello there !"], ["foo bar foobar", "foo bar foobar"]] # sbleu = sacreBLEU() # bleu_score = sbleu.compute(predictions, references) # print(bleu_score) ``` -------------------------------- ### search() Method Source: https://string2string.readthedocs.io/en/latest/matching.html Searches for a given pattern within a text using the KMP algorithm. Returns the starting index of the first occurrence of the pattern. ```APIDOC ## search(_pattern : str_, _text : str_) -> int ### Description This function searches for the pattern in the text using the KMP algorithm. ### Parameters * **pattern** (_str_) – The pattern to search for. * **text** (_str_) – The text to search in. ### Returns The index of the pattern in the text (or -1 if the pattern is not found). ### Raises * **AssertionError** – If the text is not a string. ``` -------------------------------- ### Naive Search Algorithm Implementation Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/classical.html Implements the naive string search algorithm. It iterates through the text and checks for a match at each possible starting position. Use this for simple cases or when performance is not critical. ```python class NaiveSearch(SearchAlgorithm): """ This class contains the naive search algorithm. """ def __init__(self) -> None: """ Initializes the class. Returns: None """ super().__init__() def search(self, pattern: str, text: str, ) -> int: """ Searches for the pattern in the text. Arguments: text (str): The text to search in. Returns: int: The index of the pattern in the text (or -1 if the pattern is not found). Raises: AssertionError: If the inputs are invalid. """ # Check the inputs assert isinstance(pattern, str), 'The pattern must be a string.' assert isinstance(text, str), 'The text must be a string.' # Set the attributes self.pattern = pattern self.pattern_length = len(self.pattern) # Loop over the text for i in range(len(text) - self.pattern_length + 1): # Check if the strings match if text[i:i + self.pattern_length] == self.pattern: return i # Return -1 if the pattern is not found return -1 ``` -------------------------------- ### Get Last Hidden State Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/misc/model_embeddings.html Extracts the last hidden state from the input embeddings. This is often used to get the representation of the first token (e.g., [CLS] token) in a sequence. ```python def get_last_hidden_state(self, embeddings: torch.Tensor, ) -> torch.Tensor: """ Returns the last hidden state (e.g., [CLS] token's) of the input embeddings. Arguments: embeddings (torch.Tensor): The input embeddings. Returns: torch.Tensor: The last hidden state. """ # Get the last hidden state last_hidden_state = embeddings.last_hidden_state # Return the last hidden state return last_hidden_state[:, 0, :] ``` -------------------------------- ### Import Libraries for String2string Tutorial Source: https://string2string.readthedocs.io/en/latest/hupd_example.html Import required libraries for data processing, semantic search, dataset loading, dimensionality reduction, and visualization. ```python # For data processing import numpy as np from collections import Counter # To perform semantic search via Faiss from string2string.search import FaissSearch # To load HUPD from datasets import load_dataset # To perform dimensionality reduction from sklearn.manifold import TSNE # For visualization purposes (we will use both matplotlib and plotly) %matplotlib inline import matplotlib.pyplot as plt from string2string.misc.plotting_functions import plot_corpus_embeds_with_plotly ``` -------------------------------- ### Get Embeddings Source: https://string2string.readthedocs.io/en/latest/embedding.html Generates and returns the embeddings for a list of tokens or a single string. ```APIDOC ## __call__(tokens: List[str] | str) -> Tensor ### Description This function returns the embeddings of the given tokens. ### Parameters * **tokens** (Union[List[str], str]) - The tokens to embed. ### Returns The embeddings of the given tokens. ### Return type Tensor ``` -------------------------------- ### BoyerMooreSearch.aux_get_suffix_prefix_length Source: https://string2string.readthedocs.io/en/latest/matching.html Computes the length of the longest suffix of a pattern slice (starting from index i) that matches a prefix of the entire pattern. ```APIDOC ## aux_get_suffix_prefix_length(i: int) -> int ### Description This auxiliary function is used to compute the length of the longest suffix of pattern[i:] that matches a “prefix” of the pattern. ### Parameters #### Path Parameters * **i** (int) - Required - The index of the suffix. ### Returns The length of the longest suffix of pattern[i:] that matches a “prefix” of the pattern. ### Return Type int ``` -------------------------------- ### FaissSearch Initialization Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/faiss_search.html Initializes the FaissSearch wrapper with a specified model, tokenizer, and device. It loads the tokenizer and model, sets the device, and prepares for semantic search operations. ```APIDOC ## FaissSearch ### Description Initializes the wrapper for the FAISS library, which is used to perform semantic search. ### Arguments - **model_name_or_path** (str, optional): The name or path of the model to use. Defaults to 'facebook/bart-large'. - **tokenizer_name_or_path** (str, optional): The name or path of the tokenizer to use. Defaults to 'facebook/bart-large'. - **device** (str, optional): The device to use. Defaults to 'cpu'. ### Returns None ### Attention - If you use this class, please make sure to cite the following paper: ```latex @article{johnson2019billion, title={Billion-scale similarity search with {GPUs}}, author={Johnson, Jeff and Douze, Matthijs and J{\'e}gou, Herv{\'e}}, journal={IEEE Transactions on Big Data}, volume={7}, number={3}, pages={535--547}, year={2019}, publisher={IEEE} } ``` - The code is based on the following GitHub repository: https://github.com/facebookresearch/faiss ``` -------------------------------- ### NaiveSearch Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/classical.html Implements the naive string searching algorithm. It iterates through all possible starting positions in the text and checks for a match with the pattern. ```APIDOC ## NaiveSearch ### Description Implements the naive string searching algorithm. It iterates through all possible starting positions in the text and checks for a match with the pattern. ### Methods #### `__init__()` Initializes the NaiveSearch class. #### `search(pattern: str, text: str) -> int` Searches for the pattern in the text. ### Parameters #### `search` Parameters - **pattern** (str) - The pattern to search for. - **text** (str) - The text to search in. ### Returns - **int**: The index of the pattern in the text (or -1 if the pattern is not found). ### Raises - **AssertionError**: If the inputs are invalid (e.g., not strings). ``` -------------------------------- ### SmithWaterman Initialization Source: https://string2string.readthedocs.io/en/latest/alignment.html Initializes the Smith-Waterman algorithm with specified weights and gap character. It can also use a custom match dictionary. ```APIDOC ## SmithWaterman Constructor ### Description Initializes the class variables of the Smith-Waterman algorithm, used for local alignment of sequences (e.g., strings or lists of strings) such as DNA sequences. ### Parameters * **match_weight** (int or float) - The weight of a match (default: 1). * **mismatch_weight** (int or float) - The weight of a mismatch (default: -1). * **gap_weight** (int or float) - The weight of a gap (default: -1). * **gap_char** (str) - The character used to represent a gap (default: '-') * **match_dict** (dict or None) - The dictionary that maps the characters to their match weights (default: None). ``` -------------------------------- ### Get Alignment with Hirschberg Algorithm Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/alignment/classical.html Computes and returns the global alignment of two strings or lists of strings using the Hirschberg algorithm. ```python def get_alignment(self, str1: Union[str, List[str]], str2: Union[str, List[str]], ) -> Tuple[Union[str, List[str]], Union[str, List[str]]]: r""" This function gets the alignment of two strings (or list of strings) by using the Hirschberg algorithm. Arguments: str1: The first string (or list of strings). str2: The second string (or list of strings). Returns: The aligned strings as a tuple of two strings (or list of strings). """ ``` -------------------------------- ### Initialize and Index Dataset Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/faiss_search.html Initializes the dataset, generates embeddings, and adds a FAISS index for efficient semantic search. This method processes the input corpus, generates embeddings for a specified section, and optionally saves the processed dataset. ```APIDOC ## add_faiss_index ### Description This function adds a FAISS index to the dataset for efficient semantic search. ### Arguments - **dataset_dict** (Dict[str, List[str]]): The dataset dictionary. - **section** (str): The section of the dataset to use whose embeddings will be used for semantic search (e.g., 'text', 'title') (default: 'text'). - **index_column_name** (str): The name of the column containing the embeddings (default: 'embeddings'). - **embedding_type** (str): The type of embedding to use (default: 'last_hidden_state'). - **batch_size** (int, optional): The batch size to use (default: 8). - **max_length** (int, optional): The maximum length of the input sequences. - **num_workers** (int, optional): The number of workers to use. - **save_path** (Optional[str], optional): The path to save the dataset (default: None). ### Returns - Dataset: The dataset object (HuggingFace Datasets). ### Raises - ValueError: If the dataset is not a dictionary or pandas DataFrame or HuggingFace Datasets object. ``` -------------------------------- ### Get Gap Weight for Character Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/alignment/classical.html Returns the predefined gap weight for a given character or string. This is a utility function for alignment scoring. ```python def get_gap_weight(self, c: Union[str, List[str]], ) -> float: """ This function returns the gap weight of a character or string. Arguments: c (str or list of str): The character or string. Returns: The gap weight of the character or string. """ ``` -------------------------------- ### Call Data Preparation for Plotly Source: https://string2string.readthedocs.io/en/latest/hupd_example.html Calls the prepare_plotly_data function to get the necessary data structures for creating an interactive plotly visualization. ```python # Let's prepare the data for plotly tsne_coords, tsne_labels, tsne_titles, tsne_hover_texts = prepare_plotly_data( tsne_embeddings, patent_titles, patent_ipc_subclass_labels, most_common_labels) ``` -------------------------------- ### Initialize Dataset and Add FAISS Index Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/faiss_search.html Initializes the dataset from various formats (dict, DataFrame, Dataset) and adds a FAISS index for efficient semantic search. It maps the specified section to embeddings and optionally saves the processed dataset. ```python self.dataset = Dataset.from_dict(corpus) elif isinstance(corpus, pd.DataFrame): self.dataset = Dataset.from_pandas(corpus) elif isinstance(corpus, Dataset): self.dataset = corpus else: raise ValueError('The dataset must be a dictionary or pandas DataFrame.') # Set the embedding_type self.embedding_type = embedding_type # Tokenize the dataset # self.dataset = self.dataset.map( # lambda x: x[section], # batched=True, # batch_size=batch_size, # num_proc=num_workers, # ) # Map the section of the dataset to the embeddings self.dataset = self.dataset.map( lambda x: { index_column_name: self.get_embeddings(x[section], embedding_type=self.embedding_type).detach().cpu().numpy()[0] }, # batched=True, batch_size=batch_size, num_proc=num_workers, ) # Save the dataset if save_path is not None: self.dataset.to_json(save_path) # Add FAISS index self.add_faiss_index( column_name=index_column_name, ) # Return the dataset return self.dataset ``` -------------------------------- ### BoyerMooreSearch Class Initialization Source: https://string2string.readthedocs.io/en/latest/matching.html Initializes the Boyer-Moore search algorithm class. This class implements the Boyer-Moore string searching algorithm, known for its efficiency in skipping large sections of text. ```APIDOC ## __init__() ### Description Initializes the Boyer-Moore search algorithm class. The Bayer-Moore search algorithm is a string searching algorithm that uses a heuristic to skip over large sections of the search string, resulting in faster search times than traditional algorithms. ### Parameters None ### Returns None ``` -------------------------------- ### Get Mean Pooling Embedding Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/faiss_search.html Computes the mean pooling of the input embeddings. This provides a sentence or document-level embedding by averaging all token embeddings. ```python mean_pooling = embeddings.last_hidden_state.mean(dim=1) return mean_pooling ``` -------------------------------- ### Import Plotly and NetworkX Libraries Source: https://string2string.readthedocs.io/en/latest/plagiarism_detection.html Import the necessary libraries for plotting network graphs. This is a prerequisite for the subsequent steps. ```python # Let's important specific modules that we will use for plotting a network graph with plotly import plotly.graph_objects as go import networkx as nx ``` -------------------------------- ### Get Mean Pooling Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/misc/model_embeddings.html Calculates the mean pooling of the input embeddings. This aggregates the embeddings of all tokens in a sequence into a single vector by taking the average. ```python def get_mean_pooling(self, embeddings: torch.Tensor, ) -> torch.Tensor: """ Returns the mean pooling of the input embeddings. Arguments: embeddings (torch.Tensor): The input embeddings. Returns: torch.Tensor: The mean pooling. """ # Get the mean pooling mean_pooling = embeddings.last_hidden_state.mean(dim=1) # Return the mean pooling return mean_pooling ``` -------------------------------- ### initialize_lps() Method Source: https://string2string.readthedocs.io/en/latest/matching.html Initializes the longest proper prefix suffix (lps) array. This array is crucial for the KMP algorithm's efficiency by helping to avoid redundant comparisons. ```APIDOC ## initialize_lps() ### Description Initializes the longest proper prefix suffix (lps) array, which contains the length of the longest proper prefix that is also a suffix of the pattern. ### Parameters * **pattern** (_str_) – The pattern to search for. ### Returns None ``` -------------------------------- ### Reset Polynomial Rolling Hash Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/misc/hash_functions.html Resets the current hash value to zero. This is useful when starting a new computation or re-initializing the hash state. ```python def reset(self) -> None: """ Resets the hash value. Arguments: None Returns: None """ # Reset the current hash value self.current_hash = 0 ``` -------------------------------- ### initialize_corpus Source: https://string2string.readthedocs.io/en/latest/matching.html Initializes a dataset for semantic search from various input formats like dictionaries, pandas DataFrames, or HuggingFace Datasets. ```APIDOC ## initialize_corpus ### Description Initializes a dataset using a dictionary or pandas DataFrame or HuggingFace Datasets object. ### Parameters * **dataset_dict** (Dict[str, List[str]]) - The dataset dictionary. * **section** (str) - The section of the dataset to use whose embeddings will be used for semantic search (e.g., ‘text’, ‘title’, etc.) (default: ‘text’). * **index_column_name** (str) - The name of the column containing the embeddings (default: ‘embeddings’) * **embedding_type** (str) - The type of embedding to use (default: ‘last_hidden_state’). * **batch_size** (int, optional) - The batch size to use (default: 8). * **max_length** (int, optional) - The maximum length of the input sequences. * **num_workers** (int, optional) - The number of workers to use. * **save_path** (Optional[str], optional) - The path to save the dataset (default: None). ### Returns The dataset object (HuggingFace Datasets). ### Return type Dataset ### Raises * **ValueError** - If the dataset is not a dictionary or pandas DataFrame or HuggingFace Datasets object. ``` -------------------------------- ### compute_multi_ref_score Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/similarity/bartscore.html Scores a batch of examples where each source sentence can have multiple target references. It supports different aggregation methods like 'mean' or 'max'. ```APIDOC ## compute_multi_ref_score ### Description Scores a batch of examples with multiple references. This method is used when each source sentence is associated with a list of target sentences (references). It supports aggregation of scores from multiple references. ### Method This is a method within a class, not a direct HTTP endpoint. ### Parameters #### Arguments - **source_sentences** (List[str]): The source sentences. - **target_sentences** (List[List[str]]): The target sentences, where each element is a list of reference sentences for the corresponding source sentence. - **batch_size** (int): The batch size to use for processing. Defaults to 4. - **agg** (str): The aggregation method. Can be 'mean' or 'max'. Defaults to 'mean'. ### Returns - **Dict[str, List[float]]**: A dictionary containing the aggregated BARTScore for each example. ### Raises - **ValueError**: If the number of source sentences and target sentences do not match. - **Exception**: If the number of references per sample is inconsistent. ``` -------------------------------- ### Smith-Waterman Initialization Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/alignment/classical.html Initializes the Smith-Waterman algorithm class for local sequence alignment. Configurable with match, mismatch, and gap weights, and supports a custom match dictionary. ```python def __init__(self, match_weight: Union[int, float] = 1, mismatch_weight: Union[int, float] = -1, gap_weight: Union[int, float] = -1, gap_char: str = '-', match_dict: dict = None, ) -> None: r""" This function initializes the class variables of the Smith-Waterman algorithm, used for local alignment of sequences (e.g., strings or lists of strings) such as DNA sequences. ``` -------------------------------- ### BoyerMooreSearch.search Source: https://string2string.readthedocs.io/en/latest/matching.html Searches for a given pattern within a text using the Boyer-Moore algorithm. Returns the starting index of the first occurrence of the pattern or -1 if not found. ```APIDOC ## search(pattern: str, text: str) -> int ### Description This function searches for the pattern in the text using the Boyer-Moore algorithm. ### Parameters #### Path Parameters * **pattern** (str) - Required - The pattern to search for. * **text** (str) - Required - The text to search in. ### Returns The index of the pattern in the text (or -1 if the pattern is not found). ### Return Type int ### Raises **AssertionError** – If the text or the pattern is not a string. ``` -------------------------------- ### Boyer-Moore Search Algorithm Initialization Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/classical.html Initializes the Boyer-Moore search algorithm class. This class is designed for efficient string searching using heuristics to skip sections of text. ```python def __init__(self) -> None: """ This function initializes the Boyer-Moore search algorithm class. [BM1977]_ The Bayer-Moore search algorithm is a string searching algorithm that uses a heuristic to skip over large sections of the search string, resulting in faster search times than traditional algorithms such as brute-force or Knuth-Morris-Pratt. It is particularly useful for searching for patterns in large amounts of text. .. [BM1977] Boyer, RS and Moore, JS. "A fast string searching algorithm." Communications of the ACM 20.10 (1977): 762-772. A Correct Preprocessing Algorithm for Boyer–Moore String-Searching https://www.cs.jhu.edu/~langmea/resources/lecture_notes/strings_matching_boyer_moore.pdf """ super().__init__() ``` -------------------------------- ### Initialize Tokenizer Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/misc/default_tokenizer.html Initializes the Tokenizer with a custom word delimiter. Defaults to a space if not specified. ```python tokenizer = Tokenizer(word_delimiter=" ") ``` -------------------------------- ### RabinKarpSearch.search Source: https://string2string.readthedocs.io/en/latest/matching.html Searches for a pattern within a given text using the Rabin-Karp algorithm. Returns the starting index of the first occurrence of the pattern, or -1 if not found. ```APIDOC ## RabinKarpSearch.search ### Description Searches for the pattern in the text using the Rabin-Karp algorithm. ### Method Not applicable (Python method) ### Endpoint Not applicable (Python method) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example None ### Response #### Success Response - **return value** (int) - The index of the pattern in the text (or -1 if the pattern is not found). #### Response Example None ### Raises - **AssertionError** – If the inputs are invalid. ``` -------------------------------- ### StringAlignment Class Initialization Source: https://string2string.readthedocs.io/en/latest/alignment.html Initializes the StringAlignment class with customizable weights and gap characters. ```APIDOC ## __init__ ### Description This function initializes the StringAlignment class. ### Parameters * **match_weight** (int) - The weight for a match (default: 1). * **mismatch_weight** (int) - The weight for a mismatch (default: -1). * **gap_weight** (int) - The weight for a gap (default: -1). * **gap_char** (str) - The character for a gap (default: "-"). * **match_dict** (dict | None) - The match dictionary (default: None). ### Returns None ### Note The match_dict represents a dictionary of the match weights for each pair of characters. For example, if the match_dict is {"A": {"A": 1, "T": -1}, "T": {"A": -1, "T": 1}}, then the match weight for "A" and "A" is 1, the match weight for "A" and "T" is -1, the match weight for "T" and "A" is -1, and the match weight for "T" and "T" is 1. The match_dict is particularly useful when we wish to align (or match) non-identical characters. For example, if we wish to align "A" and "T", we can set the match_dict to {"A": {"T": 1}}. This will ensure that the match weight for "A" and "T" is 1, and the match weight for "A" and "A" and "T" and "T" is 0. ``` -------------------------------- ### Get Match Weight for Characters Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/alignment/classical.html Retrieves the scoring weight for a pair of characters. Uses the match/mismatch weights if no custom dictionary is defined, otherwise consults the dictionary. ```python def get_match_weight(self, c1: Union[str, List[str]], c2: Union[str, List[str]], ) -> float: """ This function returns the match weight of two characters. Arguments: c1 (str or list of str): The first character or string. c2 (str or list of str): The second character or string. Returns: The match weight of the two characters or strings. """ # If there is no match dictionary, return the match weight if the characters are the same, and the mismatch weight otherwise. if self.match_dict is None: if c1 == c2: return self.match_weight return self.mismatch_weight # Otherwise, return the match weight according to the match dictionary. else: if c1 in self.match_dict and c2 in self.match_dict[c1]: return self.match_dict[c1][c2] else: if c1 == c2: return self.match_weight return self.mismatch_weight ``` -------------------------------- ### get_alignment Source: https://string2string.readthedocs.io/en/latest/alignment.html Gets the alignment of two strings (or list of strings) by using the Hirschberg algorithm. This method combines divide-and-conquer and dynamic programming principles for a space-efficient solution. ```APIDOC ## get_alignment ### Description Gets the alignment of two strings (or list of strings) by using the Hirschberg algorithm. ### Parameters * **str1** (str or List[str]) - The first string (or list of strings). * **str2** (str or List[str]) - The second string (or list of strings). ### Returns * Tuple[str or List[str], str or List[str]] - The aligned strings as a tuple of two strings (or list of strings). ``` -------------------------------- ### Initialize Faiss Semantic Search Tool Source: https://string2string.readthedocs.io/en/latest/hupd_example.html Initializes the FaissSearch tool with the HUPD Distil-RoBERTa model. This model is suitable for semantic search tasks requiring robust language understanding. ```python # Let's download the HUPD DistilRoBERTa model model_name = 'HUPD/hupd-distilroberta-base' faiss_search = FaissSearch(model_name_or_path = model_name) ``` -------------------------------- ### Smith-Waterman Backtracking Function Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/alignment/classical.html Overrides the Needleman-Wunsch backtrack function to perform local alignment. It starts from the highest score in the matrix and traces back until a score of zero is encountered. ```python def backtrack(self, score_matrix: np.ndarray, str1: Union[str, List[str]], str2: Union[str, List[str]], ) -> Tuple[Union[str, List[str]], Union[str, List[str]]]: """ This function overrides the backtrack function of the NeedlemanWunsch class to get an optimal local alignment between two strings (or list of strings). Arguments: score_matrix (numpy.ndarray): The score matrix. str1 (str or list of str): The first string (or list of strings). str2 (str or list of str): The second string (or list of strings). Returns: The aligned substrings as a tuple of two strings (or list of strings). .. note:: * The backtrack function used in this function is different from the backtrack function used in the Needleman-Wunsch algorithm. Here we start from the position with the highest score in the score matrix and trace back to the first position that has a score of zero. This is because the highest-scoring subsequence may not necessarily span the entire length of the sequences being aligned. * On the other hand, the backtrack function used in the Needleman-Wunsch algorithm traces back through the entire score matrix, starting from the bottom-right corner, to determine the optimal alignment path. This is because the algorithm seeks to find the global alignment of two sequences, which means aligning them from the beginning to the end. """ # Initialize the aligned substrings. aligned_str1 = "" aligned_str2 = "" # Get the position with the maximum score in the score matrix. # TODO(msuzgun): See if there is a faster way to get the position with the maximum score in the score matrix. i, j = np.unravel_index(np.argmax(score_matrix, axis=None), score_matrix.shape) # Backtrack the score matrix. while score_matrix[i, j] != 0: # Get the scores of the three possible paths. match_score = score_matrix[i - 1, j - 1] + self.get_match_weight(str1[i - 1], str2[j - 1]) delete_score = score_matrix[i - 1, j] + self.get_gap_weight(str1[i - 1]) insert_score = score_matrix[i, j - 1] + self.get_gap_weight(str2[j - 1]) # Get the maximum score. max_score = max(match_score, delete_score, insert_score) # Backtrack the score matrix. if max_score == match_score: insert_str1, insert_str2 = self.add_space_to_shorter(str1[i - 1], str2[j - 1]) i -= 1 j -= 1 elif max_score == delete_score: insert_str1, insert_str2 = self.add_space_to_shorter(str1[i - 1], self.gap_char) i -= 1 elif max_score == insert_score: insert_str1, insert_str2 = self.add_space_to_shorter(self.gap_char, str2[j - 1]) j -= 1 ``` -------------------------------- ### FastTextEmbeddings Initialization Source: https://string2string.readthedocs.io/en/latest/embedding.html Initializes the FastTextEmbeddings class with a specified model, download behavior, and directory. ```APIDOC ## FastTextEmbeddings(model: str = 'cc.en.300.bin', force_download: bool = True, dir: str | None = None) ### Description This function initializes the FastTextEmbeddings class. ### Parameters * **model** (str) - The model to use. Available models include 'cc.en.300.bin', 'wiki.en', etc. * **force_download** (bool) - Whether to force the download of the model. Default: False. * **dir** (str | None) - The directory to save and load the model. ### Raises * **ValueError** - If the given model is not available. ``` -------------------------------- ### Performing a Semantic Search Query Source: https://string2string.readthedocs.io/en/latest/plagiarism_detection.html This snippet demonstrates how to define a query string for semantic search and then execute the search using a hypothetical `faiss_search.search` function. It retrieves the top 15 results based on the query's semantic similarity. ```python # Here is the query we want to search for query = r"""Title: The Depiction of Love and Marriage in Jane Austen and William Shakespeare's Works Both Jane Austen and William Shakespeare present complex and nuanced depictions of love and marriage in their works, highlighting the social conventions and power dynamics that shape courtship and marital relationships. However, while Austen's novels emphasize the importance of compatibility, mutual respect, and shared values in successful marriages, Shakespeare's plays often explore the tragic consequences of love that is driven by passion and societal pressure rather than genuine affection. Jane Austen's novels, such as Pride and Prejudice and Sense and Sensibility, are renowned for their sharp critique of the social norms and gender roles that govern courtship and marriage in the Regency era. Austen challenges the prevailing notion that marriage is primarily a means of securing financial stability and social status, portraying characters who seek genuine emotional and intellectual connections with their partners. For instance, in Pride and Prejudice, Elizabeth Bennet resists her mother's pressure to marry a wealthy and titled man and instead falls in love with Mr. Darcy, a man whom she initially dislikes due to his pride and aloofness. Their relationship evolves through a series of misunderstandings and self-reflection, leading to a mutual recognition of their faults and virtues. Similarly, in Sense and Sensibility, the Dashwood sisters navigate the challenges of romantic attachment and social expectations, ultimately finding happiness with men who share their values and interests. In contrast, William Shakespeare's plays, such as Romeo and Juliet and Othello, often depict love and marriage as tragic and fraught with conflict. Shakespeare portrays characters who are driven by intense emotions and societal pressures, leading them to make rash decisions that result in ruin and despair. For example, in Romeo and Juliet, the young lovers defy their feuding families and elope, but their passion is ultimately their downfall, as their families' enmity leads to a series of tragic events that culminate in their deaths. Similarly, in Othello, the titular character's jealousy and mistrust of his wife, Desdemona, ultimately leads to her murder, highlighting the destructive power of toxic masculinity and insecurity. Furthermore, while Austen's novels emphasize the importance of compatibility and shared values in successful marriages, Shakespeare's plays often depict marriages that are fraught with power imbalances and emotional distance. For instance, in The Taming of the Shrew, Petruchio marries Katherine, a headstrong and independent woman, and seeks to "tame" her into submission through verbal and physical abuse. Although the play is often interpreted as a satire of gender norms, it nevertheless reinforces the patriarchal notion that women must be subordinated to men in marriage. Similarly, in Macbeth, the titular character's ambition and thirst for power lead him to murder his king and estrange himself from his wife, Lady Macbeth, who ultimately succumbs to guilt and madness. In conclusion, Jane Austen and William Shakespeare offer contrasting depictions of love and marriage in their works, reflecting the social and cultural norms of their respective eras. Austen's novels emphasize the importance of emotional and intellectual compatibility in successful marriages, challenging the notion that marriage is primarily a financial transaction. In contrast, Shakespeare's plays often explore the tragic consequences of love that is driven by passion and societal pressure rather than genuine affection. While both Austen and Shakespeare offer complex and nuanced depictions of love and marriage, their works ultimately serve as a commentary on the enduring human desire for connection and companionship. Word Count: 797 """ # Let's get the top 10 results results = faiss_search.search(query, k=15) ``` -------------------------------- ### NaiveSearch.search Source: https://string2string.readthedocs.io/en/latest/matching.html Searches for a pattern within a given text using the naive (brute force) approach. Returns the starting index of the first occurrence of the pattern, or -1 if not found. ```APIDOC ## NaiveSearch.search ### Description Searches for the pattern in the text using a brute-force method. ### Method Not applicable (Python method) ### Endpoint Not applicable (Python method) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example None ### Response #### Success Response - **return value** (int) - The index of the pattern in the text (or -1 if the pattern is not found). #### Response Example None ### Raises - **AssertionError** – If the inputs are invalid. ``` -------------------------------- ### Initialize StringAlignment Class Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/alignment/classical.html Initializes the StringAlignment class with configurable weights for matches, mismatches, and gaps. Supports a custom match dictionary for specific character pair scoring. ```python class StringAlignment: def __init__(self, match_weight: int = 1., mismatch_weight: int = -1., gap_weight: int = -1, gap_char: str = "-", match_dict: dict = None, ) -> None: r""" This function initializes the StringAlignment class. Arguments: match_weight (int): The weight for a match (default: 1). mismatch_weight (int): The weight for a mismatch (default: -1). gap_weight (int): The weight for a gap (default: -1). gap_char (str): The character for a gap (default: "-"). match_dict (dict): The match dictionary (default: None). Returns: None .. note:: The match_dict represents a dictionary of the match weights for each pair of characters. For example, if the match_dict is {"A": {"A": 1, "T": -1}, "T": {"A": -1, "T": 1}}, then the match weight for "A" and "A" is 1, the match weight for "A" and "T" is -1, the match weight for "T" and "A" is -1, and the match weight for "T" and "T" is 1. The match_dict is particularly useful when we wish to align (or match) non-identical characters. For example, if we wish to align "A" and "T", we can set the match_dict to {"A": {"T": 1}}. This will ensure that the match weight for "A" and "T" is 1, and the match weight for "A" and "A" and "T" and "T" is 0. """ # Set the weights. self.match_weight = match_weight self.mismatch_weight = mismatch_weight self.gap_weight = gap_weight self.gap_char = gap_char self.match_dict = match_dict ``` -------------------------------- ### Initialize ROUGE Wrapper Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/metrics/rouge.html Initializes the ROUGE class, optionally with a custom tokenizer. If no tokenizer is provided, a default one with a space delimiter is used. ```python from string2string.misc.default_tokenizer import Tokenizer # Initialize with default tokenizer rouge_wrapper = ROUGE() # Initialize with a custom tokenizer custom_tokenizer = Tokenizer(word_delimiter='\t') rouge_wrapper_custom = ROUGE(tokenizer=custom_tokenizer) ``` -------------------------------- ### Get Character Pair Score Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/alignment/classical.html Returns the score for a pair of characters based on match, mismatch, or gap weights. Handles cases where characters are identical, one is a gap, or they are different. ```python def get_score(self, c1: Union[str, List[str]], c2: Union[str, List[str]], ) -> float: """ This function returns the score of a character or string pair. Arguments: c1 (str or list of str): The first character or string. c2 (str or list of str): The second character or string. Returns: The score of the character or string pair. """ # If the characters are the same, return the match weight. if c1 == c2: return self.match_weight # If one of the characters is a gap, return the gap weight. elif c1 == self.gap_char or c2 == self.gap_char: return self.gap_weight # Otherwise, return the mismatch weight. else: return self.mismatch_weight ``` -------------------------------- ### Get Last Hidden State Embedding Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/faiss_search.html Extracts the last hidden state, typically the [CLS] token's representation, from input embeddings. This method is useful for sentence-level embeddings. ```python last_hidden_state = embeddings.last_hidden_state return last_hidden_state[:, 0, :] ```