### Install string2string and scikit-learn

Source: https://string2string.readthedocs.io/en/latest/hupd_example.html

Install the necessary libraries using pip. This is a prerequisite for running the tutorial.

```bash
%%capture
!pip install string2string
!pip install scikit-learn

```

--------------------------------

### Install string2string Library

Source: https://string2string.readthedocs.io/en/latest/index.html

Install the string2string library using pip. Python 3.7+ is recommended.

```bash
pip install string2string

```

--------------------------------

### Install string2string and other libraries

Source: https://string2string.readthedocs.io/en/latest/plagiarism_detection.html

Installs the string2string library along with scikit-learn and networkx for data processing and analysis.

```bash
%%capture
!pip install string2string
!pip install scikit-learn
!pip install networkx

```

--------------------------------

### Initialize FaissSearch with OPT-125M Model

Source: https://string2string.readthedocs.io/en/latest/plagiarism_detection.html

Instantiate the FaissSearch tool using a pre-trained model from Hugging Face. Ensure the 'transformers' library is installed.

```python
# Let's download OPT-125M from Facebook using HuggingFace's transformers library
model_name = 'facebook/opt-125m'
faiss_search = FaissSearch(model_name_or_path = model_name)
```

--------------------------------

### Initialize FaissSearch Wrapper

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/faiss_search.html

Initializes the FaissSearch wrapper with a specified model, tokenizer, and device. Ensure Faiss is installed and cite the relevant paper and GitHub repository if using this class.

```python
from string2string.search.faiss_search import FaissSearch

# Initialize with default model and device
searcher = FaissSearch()

# Initialize with a specific model and device
# searcher = FaissSearch(model_name_or_path='sentence-transformers/all-MiniLM-L6-v2', device='cuda')

```

--------------------------------

### Longest Common Substring Computation Setup

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/alignment/classical.html

Sets up the computation for the longest common substring by determining if inputs are lists and initializing the distance matrix.

```python
boolList = False
        if isinstance(str1, list) and isinstance(str2, list):
            boolList = True
        
        # Lengths of strings str1 and str2, respectively.
        n = len(str1)
        m = len(str2)

        # Initialize the distance matrix.
        dist = np.zeros((n + 1, m + 1), dtype=int)
```

--------------------------------

### Initialize and Compute sacreBLEU Score

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/metrics/sbleu.html

Demonstrates how to initialize the sacreBLEU class and compute the BLEU score for a given set of predictions and references. It shows an example of calling the compute method with default parameters.

```python
from typing import Union, Optional, List, Dict
from string2string.misc.default_tokenizer import Tokenizer
from sacrebleu import corpus_bleu

ALLOWED_TOKENIZERS = {
    'none': 'tokenizer_none.NoneTokenizer',
    'zh': 'tokenizer_zh.TokenizerZh',
    '13a': 'tokenizer_13a.Tokenizer13a',
    'intl': 'tokenizer_intl.TokenizerV14International',
    'char': 'tokenizer_char.TokenizerChar',
    'ja-mecab': 'tokenizer_ja_mecab.TokenizerJaMecab',
    'ko-mecab': 'tokenizer_ko_mecab.TokenizerKoMecab',
    'spm': 'tokenizer_spm.TokenizerSPM',
    'flores101': 'tokenizer_spm.Flores101Tokenizer',
    'flores200': 'tokenizer_spm.Flores200Tokenizer',
}


class sacreBLEU:
    """
    This class contains the sacreBLEU metric.
    """


    def __init__(self) -> None:
        """
        Initializes the BLEU class.
        """
        pass


    def compute(self,
        predictions: List[str],
        references: List[List[str]],
        smooth_method: str = 'exp',
        smooth_value: Optional[float] = None,
        lowercase: bool = False,
        tokenizer_name: Optional[str] = 'none',
        use_effective_order: bool = False,
        return_only: List[str] = ['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len']
    ):
        """
        Returns the BLEU score between a list of predictions and list of list of references.

        Arguments:
            predictions (List[str]): The predictions.
            references (List[List[str]]): The references (or ground truth strings).
            smooth_method (str): The smoothing method. Default is "exp". Other options are "floor", "add-k" and "none".
            smooth_value (Optional[float]): The smoothing value for floor and add-k smoothing. Default is None.
            lowercase (bool): Whether to lowercase the text. Default is False.
            tokenizer_name (str): The tokenizer name. Default is "none". Other options are "zh", "13a", "intl", "char", "ja-mecab", "ko-mecab", "spm", "flores101" and "flores200".
            use_effective_order (bool): Whether to use the effective order. Default is False.
            return_only (Optional[List[str]]): The list of BLEU score components to return. Default is ['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len'].

        Returns:
            Dict[str, float]: The BLEU score (between 0 and 1).

        Raises:
            ValueError: If the number of predictions does not match the number of references.
            ValueError: If the tokenizer name is invalid.
        """

        # Check that the number of predictions matches the number of references
        if len(predictions) != len(references):
            raise ValueError('The number of predictions does not match the number of references.')

        # Check that the tokenizer name is valid
        if tokenizer_name not in ALLOWED_TOKENIZERS:
            raise ValueError('The tokenizer name is invalid.')
        
        # Check that the size of each reference list is the same
        reference_size = len(references[0])
        for reference in references:
            if len(reference) != reference_size:
                raise ValueError('The size of each reference list is not the same.')
        
        # Transform the references into a list of list of references.
        # This is necessary because sacrebleu.corpus_bleu expects a list of list of references.
        transformed_references = [[refs[i] for refs in references] for i in range(reference_size)]

        # Compute the BLEU score using sacrebleu.corpus_bleu
        # This function returns "BLEUScore(score, correct, total, precisions, bp, sys_len, ref_len)"
        bleu_score = corpus_bleu(
            hypotheses=predictions,
            references=transformed_references,
            smooth_method=smooth_method,
            smooth_value=smooth_value,
            lowercase=lowercase,
            use_effective_order=use_effective_order,
            **(dict(tokenize=ALLOWED_TOKENIZERS[tokenizer_name]) if tokenizer_name != 'none' else {}),
        )

        # Get a summary of all the relevant BLEU score components
        final_scores = {k: getattr(bleu_score, k) for k in return_only}

        # Return the BLEU score
        return final_scores

    
# predictions = ["hello there general kenobi", "foo bar foobar"]
# references = [["hello there general kenobi", "hello there !"], ["foo bar foobar", "foo bar foobar"]]

# sbleu = sacreBLEU()
# bleu_score = sbleu.compute(predictions, references)
# print(bleu_score)

```

--------------------------------

### search() Method

Source: https://string2string.readthedocs.io/en/latest/matching.html

Searches for a given pattern within a text using the KMP algorithm. Returns the starting index of the first occurrence of the pattern.

```APIDOC
## search(_pattern : str_, _text : str_) -> int

### Description
This function searches for the pattern in the text using the KMP algorithm.

### Parameters
* **pattern** (_str_) – The pattern to search for.
* **text** (_str_) – The text to search in.

### Returns
The index of the pattern in the text (or -1 if the pattern is not found).

### Raises
* **AssertionError** – If the text is not a string.
```

--------------------------------

### Naive Search Algorithm Implementation

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/classical.html

Implements the naive string search algorithm. It iterates through the text and checks for a match at each possible starting position. Use this for simple cases or when performance is not critical.

```python
class NaiveSearch(SearchAlgorithm):
    """
    This class contains the naive search algorithm.
    """


    def __init__(self) -> None:
        """
        Initializes the class.

        Returns:
            None
        """
        super().__init__()


    def search(self,
        pattern: str,
        text: str,
    ) -> int:
        """
        Searches for the pattern in the text.

        Arguments:
            text (str): The text to search in.

        Returns:
            int: The index of the pattern in the text (or -1 if the pattern is not found).

        Raises:
            AssertionError: If the inputs are invalid.
        """
        # Check the inputs
        assert isinstance(pattern, str), 'The pattern must be a string.'
        assert isinstance(text, str), 'The text must be a string.'

        # Set the attributes
        self.pattern = pattern
        self.pattern_length = len(self.pattern)

        # Loop over the text
        for i in range(len(text) - self.pattern_length + 1):
            # Check if the strings match
            if text[i:i + self.pattern_length] == self.pattern:
                return i
            
        # Return -1 if the pattern is not found
        return -1
```

--------------------------------

### Get Last Hidden State

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/misc/model_embeddings.html

Extracts the last hidden state from the input embeddings. This is often used to get the representation of the first token (e.g., [CLS] token) in a sequence.

```python
def get_last_hidden_state(self,
        embeddings: torch.Tensor,
    ) -> torch.Tensor:
        """
        Returns the last hidden state (e.g., [CLS] token's) of the input embeddings.

        Arguments:
            embeddings (torch.Tensor): The input embeddings.

        Returns:
            torch.Tensor: The last hidden state.
        """

        # Get the last hidden state
        last_hidden_state = embeddings.last_hidden_state

        # Return the last hidden state
        return last_hidden_state[:, 0, :]

```

--------------------------------

### Import Libraries for String2string Tutorial

Source: https://string2string.readthedocs.io/en/latest/hupd_example.html

Import required libraries for data processing, semantic search, dataset loading, dimensionality reduction, and visualization.

```python
# For data processing
import numpy as np
from collections import Counter

# To perform semantic search via Faiss
from string2string.search import FaissSearch

# To load HUPD
from datasets import load_dataset

# To perform dimensionality reduction
from sklearn.manifold import TSNE

# For visualization purposes (we will use both matplotlib and plotly)
%matplotlib inline
import matplotlib.pyplot as plt
from string2string.misc.plotting_functions import plot_corpus_embeds_with_plotly

```

--------------------------------

### Get Embeddings

Source: https://string2string.readthedocs.io/en/latest/embedding.html

Generates and returns the embeddings for a list of tokens or a single string.

```APIDOC
## __call__(tokens: List[str] | str) -> Tensor

### Description
This function returns the embeddings of the given tokens.

### Parameters
* **tokens** (Union[List[str], str]) - The tokens to embed.

### Returns
The embeddings of the given tokens.

### Return type
Tensor
```

--------------------------------

### BoyerMooreSearch.aux_get_suffix_prefix_length

Source: https://string2string.readthedocs.io/en/latest/matching.html

Computes the length of the longest suffix of a pattern slice (starting from index i) that matches a prefix of the entire pattern.

```APIDOC
## aux_get_suffix_prefix_length(i: int) -> int

### Description
This auxiliary function is used to compute the length of the longest suffix of pattern[i:] that matches a “prefix” of the pattern.

### Parameters

#### Path Parameters

* **i** (int) - Required - The index of the suffix.

### Returns

The length of the longest suffix of pattern[i:] that matches a “prefix” of the pattern.

### Return Type

int
```

--------------------------------

### FaissSearch Initialization

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/faiss_search.html

Initializes the FaissSearch wrapper with a specified model, tokenizer, and device. It loads the tokenizer and model, sets the device, and prepares for semantic search operations.

```APIDOC
## FaissSearch

### Description
Initializes the wrapper for the FAISS library, which is used to perform semantic search.

### Arguments
- **model_name_or_path** (str, optional): The name or path of the model to use. Defaults to 'facebook/bart-large'.
- **tokenizer_name_or_path** (str, optional): The name or path of the tokenizer to use. Defaults to 'facebook/bart-large'.
- **device** (str, optional): The device to use. Defaults to 'cpu'.

### Returns
None

### Attention
- If you use this class, please make sure to cite the following paper:
  ```latex
  @article{johnson2019billion,
      title={Billion-scale similarity search with {GPUs}},
      author={Johnson, Jeff and Douze, Matthijs and J{\'e}gou, Herv{\'e}},
      journal={IEEE Transactions on Big Data},
      volume={7},
      number={3},
      pages={535--547},
      year={2019},
      publisher={IEEE}
  }
  ```
- The code is based on the following GitHub repository:
  https://github.com/facebookresearch/faiss
```

--------------------------------

### NaiveSearch

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/classical.html

Implements the naive string searching algorithm. It iterates through all possible starting positions in the text and checks for a match with the pattern.

```APIDOC
## NaiveSearch

### Description
Implements the naive string searching algorithm. It iterates through all possible starting positions in the text and checks for a match with the pattern.

### Methods
#### `__init__()`
Initializes the NaiveSearch class.

#### `search(pattern: str, text: str) -> int`
Searches for the pattern in the text.

### Parameters
#### `search` Parameters
- **pattern** (str) - The pattern to search for.
- **text** (str) - The text to search in.

### Returns
- **int**: The index of the pattern in the text (or -1 if the pattern is not found).

### Raises
- **AssertionError**: If the inputs are invalid (e.g., not strings).
```

--------------------------------

### SmithWaterman Initialization

Source: https://string2string.readthedocs.io/en/latest/alignment.html

Initializes the Smith-Waterman algorithm with specified weights and gap character. It can also use a custom match dictionary.

```APIDOC
## SmithWaterman Constructor

### Description
Initializes the class variables of the Smith-Waterman algorithm, used for local alignment of sequences (e.g., strings or lists of strings) such as DNA sequences.

### Parameters
* **match_weight** (int or float) - The weight of a match (default: 1).
* **mismatch_weight** (int or float) - The weight of a mismatch (default: -1).
* **gap_weight** (int or float) - The weight of a gap (default: -1).
* **gap_char** (str) - The character used to represent a gap (default: '-')
* **match_dict** (dict or None) - The dictionary that maps the characters to their match weights (default: None).
```

--------------------------------

### Get Alignment with Hirschberg Algorithm

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/alignment/classical.html

Computes and returns the global alignment of two strings or lists of strings using the Hirschberg algorithm.

```python
def get_alignment(self,
        str1: Union[str, List[str]],
        str2: Union[str, List[str]],
    ) -> Tuple[Union[str, List[str]], Union[str, List[str]]]:
        r"""
        This function gets the alignment of two strings (or list of strings) by using the Hirschberg algorithm.

        Arguments:
            str1: The first string (or list of strings).
            str2: The second string (or list of strings).

        Returns:
            The aligned strings as a tuple of two strings (or list of strings).
        """
```

--------------------------------

### Initialize and Index Dataset

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/faiss_search.html

Initializes the dataset, generates embeddings, and adds a FAISS index for efficient semantic search. This method processes the input corpus, generates embeddings for a specified section, and optionally saves the processed dataset.

```APIDOC
## add_faiss_index

### Description
This function adds a FAISS index to the dataset for efficient semantic search.

### Arguments
- **dataset_dict** (Dict[str, List[str]]): The dataset dictionary.
- **section** (str): The section of the dataset to use whose embeddings will be used for semantic search (e.g., 'text', 'title') (default: 'text').
- **index_column_name** (str): The name of the column containing the embeddings (default: 'embeddings').
- **embedding_type** (str): The type of embedding to use (default: 'last_hidden_state').
- **batch_size** (int, optional): The batch size to use (default: 8).
- **max_length** (int, optional): The maximum length of the input sequences.
- **num_workers** (int, optional): The number of workers to use.
- **save_path** (Optional[str], optional): The path to save the dataset (default: None).

### Returns
- Dataset: The dataset object (HuggingFace Datasets).

### Raises
- ValueError: If the dataset is not a dictionary or pandas DataFrame or HuggingFace Datasets object.
```

--------------------------------

### Get Gap Weight for Character

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/alignment/classical.html

Returns the predefined gap weight for a given character or string. This is a utility function for alignment scoring.

```python
def get_gap_weight(self,
        c: Union[str, List[str]],
    ) -> float:
        """
        This function returns the gap weight of a character or string.

        Arguments:
            c (str or list of str): The character or string.

        Returns:
            The gap weight of the character or string.
        """
        
```

--------------------------------

### Call Data Preparation for Plotly

Source: https://string2string.readthedocs.io/en/latest/hupd_example.html

Calls the prepare_plotly_data function to get the necessary data structures for creating an interactive plotly visualization.

```python
# Let's prepare the data for plotly
tsne_coords, tsne_labels, tsne_titles, tsne_hover_texts = prepare_plotly_data(
    tsne_embeddings, patent_titles, patent_ipc_subclass_labels, most_common_labels)
```

--------------------------------

### Initialize Dataset and Add FAISS Index

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/faiss_search.html

Initializes the dataset from various formats (dict, DataFrame, Dataset) and adds a FAISS index for efficient semantic search. It maps the specified section to embeddings and optionally saves the processed dataset.

```python
self.dataset = Dataset.from_dict(corpus)
        elif isinstance(corpus, pd.DataFrame):
            self.dataset = Dataset.from_pandas(corpus)
        elif isinstance(corpus, Dataset):
            self.dataset = corpus
        else:
            raise ValueError('The dataset must be a dictionary or pandas DataFrame.')
        
        # Set the embedding_type
        self.embedding_type = embedding_type
            

        # Tokenize the dataset
        # self.dataset = self.dataset.map(
        #     lambda x: x[section],
        #     batched=True,
        #     batch_size=batch_size,
        #     num_proc=num_workers,
        # )

        # Map the section of the dataset to the embeddings
        self.dataset = self.dataset.map(
            lambda x: {
                index_column_name: self.get_embeddings(x[section], embedding_type=self.embedding_type).detach().cpu().numpy()[0]
                },
            # batched=True,
            batch_size=batch_size,
            num_proc=num_workers,
        )

        # Save the dataset
        if save_path is not None:
            self.dataset.to_json(save_path)

        # Add FAISS index
        self.add_faiss_index(
            column_name=index_column_name,
        )

        # Return the dataset
        return self.dataset
```

--------------------------------

### BoyerMooreSearch Class Initialization

Source: https://string2string.readthedocs.io/en/latest/matching.html

Initializes the Boyer-Moore search algorithm class. This class implements the Boyer-Moore string searching algorithm, known for its efficiency in skipping large sections of text.

```APIDOC
## __init__() 

### Description
Initializes the Boyer-Moore search algorithm class. The Bayer-Moore search algorithm is a string searching algorithm that uses a heuristic to skip over large sections of the search string, resulting in faster search times than traditional algorithms.

### Parameters

None

### Returns

None
```

--------------------------------

### Get Mean Pooling Embedding

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/faiss_search.html

Computes the mean pooling of the input embeddings. This provides a sentence or document-level embedding by averaging all token embeddings.

```python
mean_pooling = embeddings.last_hidden_state.mean(dim=1)
return mean_pooling
```

--------------------------------

### Import Plotly and NetworkX Libraries

Source: https://string2string.readthedocs.io/en/latest/plagiarism_detection.html

Import the necessary libraries for plotting network graphs. This is a prerequisite for the subsequent steps.

```python
# Let's important specific modules that we will use for plotting a network graph with plotly
import plotly.graph_objects as go
import networkx as nx
```

--------------------------------

### Get Mean Pooling

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/misc/model_embeddings.html

Calculates the mean pooling of the input embeddings. This aggregates the embeddings of all tokens in a sequence into a single vector by taking the average.

```python
def get_mean_pooling(self,
        embeddings: torch.Tensor,
    ) -> torch.Tensor:
        """
        Returns the mean pooling of the input embeddings.

        Arguments:
            embeddings (torch.Tensor): The input embeddings.

        Returns:
            torch.Tensor: The mean pooling.
        """

        # Get the mean pooling
        mean_pooling = embeddings.last_hidden_state.mean(dim=1)

        # Return the mean pooling
        return mean_pooling

```

--------------------------------

### initialize_lps() Method

Source: https://string2string.readthedocs.io/en/latest/matching.html

Initializes the longest proper prefix suffix (lps) array. This array is crucial for the KMP algorithm's efficiency by helping to avoid redundant comparisons.

```APIDOC
## initialize_lps()

### Description
Initializes the longest proper prefix suffix (lps) array, which contains the length of the longest proper prefix that is also a suffix of the pattern.

### Parameters
* **pattern** (_str_) – The pattern to search for.

### Returns
None
```

--------------------------------

### Reset Polynomial Rolling Hash

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/misc/hash_functions.html

Resets the current hash value to zero. This is useful when starting a new computation or re-initializing the hash state.

```python
def reset(self) -> None:
        """
        Resets the hash value.

        Arguments:
            None

        Returns:
            None
        """
        # Reset the current hash value
        self.current_hash = 0
```

--------------------------------

### initialize_corpus

Source: https://string2string.readthedocs.io/en/latest/matching.html

Initializes a dataset for semantic search from various input formats like dictionaries, pandas DataFrames, or HuggingFace Datasets.

```APIDOC
## initialize_corpus

### Description
Initializes a dataset using a dictionary or pandas DataFrame or HuggingFace Datasets object.

### Parameters
* **dataset_dict** (Dict[str, List[str]]) - The dataset dictionary.
* **section** (str) - The section of the dataset to use whose embeddings will be used for semantic search (e.g., ‘text’, ‘title’, etc.) (default: ‘text’).
* **index_column_name** (str) - The name of the column containing the embeddings (default: ‘embeddings’)
* **embedding_type** (str) - The type of embedding to use (default: ‘last_hidden_state’).
* **batch_size** (int, optional) - The batch size to use (default: 8).
* **max_length** (int, optional) - The maximum length of the input sequences.
* **num_workers** (int, optional) - The number of workers to use.
* **save_path** (Optional[str], optional) - The path to save the dataset (default: None).

### Returns
The dataset object (HuggingFace Datasets).

### Return type
Dataset

### Raises
* **ValueError** - If the dataset is not a dictionary or pandas DataFrame or HuggingFace Datasets object.
```

--------------------------------

### compute_multi_ref_score

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/similarity/bartscore.html

Scores a batch of examples where each source sentence can have multiple target references. It supports different aggregation methods like 'mean' or 'max'.

```APIDOC
## compute_multi_ref_score

### Description
Scores a batch of examples with multiple references. This method is used when each source sentence is associated with a list of target sentences (references). It supports aggregation of scores from multiple references.

### Method
This is a method within a class, not a direct HTTP endpoint.

### Parameters
#### Arguments
- **source_sentences** (List[str]): The source sentences.
- **target_sentences** (List[List[str]]): The target sentences, where each element is a list of reference sentences for the corresponding source sentence.
- **batch_size** (int): The batch size to use for processing. Defaults to 4.
- **agg** (str): The aggregation method. Can be 'mean' or 'max'. Defaults to 'mean'.

### Returns
- **Dict[str, List[float]]**: A dictionary containing the aggregated BARTScore for each example.

### Raises
- **ValueError**: If the number of source sentences and target sentences do not match.
- **Exception**: If the number of references per sample is inconsistent.
```

--------------------------------

### Smith-Waterman Initialization

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/alignment/classical.html

Initializes the Smith-Waterman algorithm class for local sequence alignment. Configurable with match, mismatch, and gap weights, and supports a custom match dictionary.

```python
def __init__(self,
        match_weight: Union[int, float] = 1,
        mismatch_weight: Union[int, float] = -1,
        gap_weight: Union[int, float] = -1,
        gap_char: str = '-',
        match_dict: dict = None,
    ) -> None:
        r"""
        This function initializes the class variables of the Smith-Waterman algorithm, used for local alignment of sequences (e.g., strings or lists of strings) such as DNA sequences.

```

--------------------------------

### BoyerMooreSearch.search

Source: https://string2string.readthedocs.io/en/latest/matching.html

Searches for a given pattern within a text using the Boyer-Moore algorithm. Returns the starting index of the first occurrence of the pattern or -1 if not found.

```APIDOC
## search(pattern: str, text: str) -> int

### Description
This function searches for the pattern in the text using the Boyer-Moore algorithm.

### Parameters

#### Path Parameters

* **pattern** (str) - Required - The pattern to search for.
* **text** (str) - Required - The text to search in.

### Returns

The index of the pattern in the text (or -1 if the pattern is not found).

### Return Type

int

### Raises

**AssertionError** – If the text or the pattern is not a string.
```

--------------------------------

### Boyer-Moore Search Algorithm Initialization

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/classical.html

Initializes the Boyer-Moore search algorithm class. This class is designed for efficient string searching using heuristics to skip sections of text.

```python
    def __init__(self) -> None:
        """
        This function initializes the Boyer-Moore search algorithm class. [BM1977]_

        The Bayer-Moore search algorithm is a string searching algorithm that uses a heuristic to skip over large sections of the search string, resulting in faster search times than traditional algorithms such as brute-force or Knuth-Morris-Pratt. It is particularly useful for searching for patterns in large amounts of text.

        .. [BM1977] Boyer, RS and Moore, JS. "A fast string searching algorithm." Communications of the ACM 20.10 (1977): 762-772.
            
A Correct Preprocessing Algorithm for Boyer–Moore String-Searching

        https://www.cs.jhu.edu/~langmea/resources/lecture_notes/strings_matching_boyer_moore.pdf

        """
        super().__init__()
```

--------------------------------

### Initialize Tokenizer

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/misc/default_tokenizer.html

Initializes the Tokenizer with a custom word delimiter. Defaults to a space if not specified.

```python
tokenizer = Tokenizer(word_delimiter=" ")
```

--------------------------------

### RabinKarpSearch.search

Source: https://string2string.readthedocs.io/en/latest/matching.html

Searches for a pattern within a given text using the Rabin-Karp algorithm. Returns the starting index of the first occurrence of the pattern, or -1 if not found.

```APIDOC
## RabinKarpSearch.search

### Description
Searches for the pattern in the text using the Rabin-Karp algorithm.

### Method
Not applicable (Python method)

### Endpoint
Not applicable (Python method)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
None

### Response
#### Success Response
- **return value** (int) - The index of the pattern in the text (or -1 if the pattern is not found).

#### Response Example
None

### Raises
- **AssertionError** – If the inputs are invalid.
```

--------------------------------

### StringAlignment Class Initialization

Source: https://string2string.readthedocs.io/en/latest/alignment.html

Initializes the StringAlignment class with customizable weights and gap characters.

```APIDOC
## __init__

### Description
This function initializes the StringAlignment class.

### Parameters
* **match_weight** (int) - The weight for a match (default: 1).
* **mismatch_weight** (int) - The weight for a mismatch (default: -1).
* **gap_weight** (int) - The weight for a gap (default: -1).
* **gap_char** (str) - The character for a gap (default: "-").
* **match_dict** (dict | None) - The match dictionary (default: None).

### Returns
None

### Note
The match_dict represents a dictionary of the match weights for each pair of characters. For example, if the match_dict is {"A": {"A": 1, "T": -1}, "T": {"A": -1, "T": 1}}, then the match weight for "A" and "A" is 1, the match weight for "A" and "T" is -1, the match weight for "T" and "A" is -1, and the match weight for "T" and "T" is 1. The match_dict is particularly useful when we wish to align (or match) non-identical characters. For example, if we wish to align "A" and "T", we can set the match_dict to {"A": {"T": 1}}. This will ensure that the match weight for "A" and "T" is 1, and the match weight for "A" and "A" and "T" and "T" is 0.
```

--------------------------------

### Get Match Weight for Characters

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/alignment/classical.html

Retrieves the scoring weight for a pair of characters. Uses the match/mismatch weights if no custom dictionary is defined, otherwise consults the dictionary.

```python
def get_match_weight(self, 
        c1: Union[str, List[str]],
        c2: Union[str, List[str]],
    ) -> float:
        """
        This function returns the match weight of two characters.

        Arguments:
            c1 (str or list of str): The first character or string.
            c2 (str or list of str): The second character or string.

        Returns:
            The match weight of the two characters or strings.
        """

        # If there is no match dictionary, return the match weight if the characters are the same, and the mismatch weight otherwise.
        if self.match_dict is None:
            if c1 == c2:
                return self.match_weight
            return self.mismatch_weight
        # Otherwise, return the match weight according to the match dictionary.
        else:
            if c1 in self.match_dict and c2 in self.match_dict[c1]:
                return self.match_dict[c1][c2]
            else:
                if c1 == c2:
                    return self.match_weight
                return self.mismatch_weight
```

--------------------------------

### get_alignment

Source: https://string2string.readthedocs.io/en/latest/alignment.html

Gets the alignment of two strings (or list of strings) by using the Hirschberg algorithm. This method combines divide-and-conquer and dynamic programming principles for a space-efficient solution.

```APIDOC
## get_alignment

### Description
Gets the alignment of two strings (or list of strings) by using the Hirschberg algorithm.

### Parameters
* **str1** (str or List[str]) - The first string (or list of strings).
* **str2** (str or List[str]) - The second string (or list of strings).

### Returns
* Tuple[str or List[str], str or List[str]] - The aligned strings as a tuple of two strings (or list of strings).
```

--------------------------------

### Initialize Faiss Semantic Search Tool

Source: https://string2string.readthedocs.io/en/latest/hupd_example.html

Initializes the FaissSearch tool with the HUPD Distil-RoBERTa model. This model is suitable for semantic search tasks requiring robust language understanding.

```python
# Let's download the HUPD DistilRoBERTa model
model_name = 'HUPD/hupd-distilroberta-base'
faiss_search = FaissSearch(model_name_or_path = model_name)
```

--------------------------------

### Smith-Waterman Backtracking Function

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/alignment/classical.html

Overrides the Needleman-Wunsch backtrack function to perform local alignment. It starts from the highest score in the matrix and traces back until a score of zero is encountered.

```python
def backtrack(self,
        score_matrix: np.ndarray,
        str1: Union[str, List[str]],
        str2: Union[str, List[str]],
    ) -> Tuple[Union[str, List[str]], Union[str, List[str]]]:
        """
        This function overrides the backtrack function of the NeedlemanWunsch class to get an optimal local alignment between two strings (or list of strings).

        Arguments:
            score_matrix (numpy.ndarray): The score matrix.
            str1 (str or list of str): The first string (or list of strings).
            str2 (str or list of str): The second string (or list of strings).

        Returns:
            The aligned substrings as a tuple of two strings (or list of strings).

        .. note::
            * The backtrack function used in this function is different from the backtrack function used in the Needleman-Wunsch algorithm. Here we start from the position with the highest score in the score matrix and trace back to the first position that has a score of zero. This is because the highest-scoring subsequence may not necessarily span the entire length of the sequences being aligned.
            * On the other hand, the backtrack function used in the Needleman-Wunsch algorithm traces back through the entire score matrix, starting from the bottom-right corner, to determine the optimal alignment path. This is because the algorithm seeks to find the global alignment of two sequences, which means aligning them from the beginning to the end.
        """

        # Initialize the aligned substrings.
        aligned_str1 = ""
        aligned_str2 = ""

        # Get the position with the maximum score in the score matrix.
        # TODO(msuzgun): See if there is a faster way to get the position with the maximum score in the score matrix.
        i, j = np.unravel_index(np.argmax(score_matrix, axis=None), score_matrix.shape)

        # Backtrack the score matrix.
        while score_matrix[i, j] != 0:
            # Get the scores of the three possible paths.
            match_score = score_matrix[i - 1, j - 1] + self.get_match_weight(str1[i - 1], str2[j - 1])
            delete_score = score_matrix[i - 1, j] + self.get_gap_weight(str1[i - 1])
            insert_score = score_matrix[i, j - 1] + self.get_gap_weight(str2[j - 1])

            # Get the maximum score.
            max_score = max(match_score, delete_score, insert_score)

            # Backtrack the score matrix.
            if max_score == match_score:
                insert_str1, insert_str2 = self.add_space_to_shorter(str1[i - 1], str2[j - 1])
                i -= 1
                j -= 1
            elif max_score == delete_score:
                insert_str1, insert_str2 = self.add_space_to_shorter(str1[i - 1], self.gap_char)
                i -= 1
            elif max_score == insert_score:
                insert_str1, insert_str2 = self.add_space_to_shorter(self.gap_char, str2[j - 1])
                j -= 1

```

--------------------------------

### FastTextEmbeddings Initialization

Source: https://string2string.readthedocs.io/en/latest/embedding.html

Initializes the FastTextEmbeddings class with a specified model, download behavior, and directory.

```APIDOC
## FastTextEmbeddings(model: str = 'cc.en.300.bin', force_download: bool = True, dir: str | None = None)

### Description
This function initializes the FastTextEmbeddings class.

### Parameters
* **model** (str) - The model to use. Available models include 'cc.en.300.bin', 'wiki.en', etc.
* **force_download** (bool) - Whether to force the download of the model. Default: False.
* **dir** (str | None) - The directory to save and load the model.

### Raises
* **ValueError** - If the given model is not available.
```

--------------------------------

### Performing a Semantic Search Query

Source: https://string2string.readthedocs.io/en/latest/plagiarism_detection.html

This snippet demonstrates how to define a query string for semantic search and then execute the search using a hypothetical `faiss_search.search` function. It retrieves the top 15 results based on the query's semantic similarity.

```python
# Here is the query we want to search for
query = r"""Title: The Depiction of Love and Marriage in Jane Austen and William Shakespeare's Works

Both Jane Austen and William Shakespeare present complex and nuanced depictions of love and marriage in their works, highlighting the social conventions and power dynamics that shape courtship and marital relationships. However, while Austen's novels emphasize the importance of compatibility, mutual respect, and shared values in successful marriages, Shakespeare's plays often explore the tragic consequences of love that is driven by passion and societal pressure rather than genuine affection.

Jane Austen's novels, such as Pride and Prejudice and Sense and Sensibility, are renowned for their sharp critique of the social norms and gender roles that govern courtship and marriage in the Regency era. Austen challenges the prevailing notion that marriage is primarily a means of securing financial stability and social status, portraying characters who seek genuine emotional and intellectual connections with their partners. For instance, in Pride and Prejudice, Elizabeth Bennet resists her mother's pressure to marry a wealthy and titled man and instead falls in love with Mr. Darcy, a man whom she initially dislikes due to his pride and aloofness. Their relationship evolves through a series of misunderstandings and self-reflection, leading to a mutual recognition of their faults and virtues. Similarly, in Sense and Sensibility, the Dashwood sisters navigate the challenges of romantic attachment and social expectations, ultimately finding happiness with men who share their values and interests.

In contrast, William Shakespeare's plays, such as Romeo and Juliet and Othello, often depict love and marriage as tragic and fraught with conflict. Shakespeare portrays characters who are driven by intense emotions and societal pressures, leading them to make rash decisions that result in ruin and despair. For example, in Romeo and Juliet, the young lovers defy their feuding families and elope, but their passion is ultimately their downfall, as their families' enmity leads to a series of tragic events that culminate in their deaths. Similarly, in Othello, the titular character's jealousy and mistrust of his wife, Desdemona, ultimately leads to her murder, highlighting the destructive power of toxic masculinity and insecurity.

Furthermore, while Austen's novels emphasize the importance of compatibility and shared values in successful marriages, Shakespeare's plays often depict marriages that are fraught with power imbalances and emotional distance. For instance, in The Taming of the Shrew, Petruchio marries Katherine, a headstrong and independent woman, and seeks to "tame" her into submission through verbal and physical abuse. Although the play is often interpreted as a satire of gender norms, it nevertheless reinforces the patriarchal notion that women must be subordinated to men in marriage. Similarly, in Macbeth, the titular character's ambition and thirst for power lead him to murder his king and estrange himself from his wife, Lady Macbeth, who ultimately succumbs to guilt and madness.

In conclusion, Jane Austen and William Shakespeare offer contrasting depictions of love and marriage in their works, reflecting the social and cultural norms of their respective eras. Austen's novels emphasize the importance of emotional and intellectual compatibility in successful marriages, challenging the notion that marriage is primarily a financial transaction. In contrast, Shakespeare's plays often explore the tragic consequences of love that is driven by passion and societal pressure rather than genuine affection. While both Austen and Shakespeare offer complex and nuanced depictions of love and marriage, their works ultimately serve as a commentary on the enduring human desire for connection and companionship.

Word Count: 797
"""

# Let's get the top 10 results
results = faiss_search.search(query, k=15)

```

--------------------------------

### NaiveSearch.search

Source: https://string2string.readthedocs.io/en/latest/matching.html

Searches for a pattern within a given text using the naive (brute force) approach. Returns the starting index of the first occurrence of the pattern, or -1 if not found.

```APIDOC
## NaiveSearch.search

### Description
Searches for the pattern in the text using a brute-force method.

### Method
Not applicable (Python method)

### Endpoint
Not applicable (Python method)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
None

### Response
#### Success Response
- **return value** (int) - The index of the pattern in the text (or -1 if the pattern is not found).

#### Response Example
None

### Raises
- **AssertionError** – If the inputs are invalid.
```

--------------------------------

### Initialize StringAlignment Class

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/alignment/classical.html

Initializes the StringAlignment class with configurable weights for matches, mismatches, and gaps. Supports a custom match dictionary for specific character pair scoring.

```python
class StringAlignment:
    def __init__(self,
        match_weight: int = 1.,
        mismatch_weight: int = -1.,
        gap_weight: int = -1,
        gap_char: str = "-",
        match_dict: dict = None,
        ) -> None:
        r"""
        This function initializes the StringAlignment class.

        Arguments:
            match_weight (int): The weight for a match (default: 1).
            mismatch_weight (int): The weight for a mismatch (default: -1).
            gap_weight (int): The weight for a gap (default: -1).
            gap_char (str): The character for a gap (default: "-").
            match_dict (dict): The match dictionary (default: None).

        Returns:
            None

        .. note::

            The match_dict represents a dictionary of the match weights for each pair of characters. For example, if the match_dict is {"A": {"A": 1, "T": -1}, "T": {"A": -1, "T": 1}}, then the match weight for "A" and "A" is 1, the match weight for "A" and "T" is -1, the match weight for "T" and "A" is -1, and the match weight for "T" and "T" is 1.
            The match_dict is particularly useful when we wish to align (or match) non-identical characters. For example, if we wish to align "A" and "T", we can set the match_dict to {"A": {"T": 1}}. This will ensure that the match weight for "A" and "T" is 1, and the match weight for "A" and "A" and "T" and "T" is 0.
        """
        # Set the weights.
        self.match_weight = match_weight
        self.mismatch_weight = mismatch_weight
        self.gap_weight = gap_weight
        self.gap_char = gap_char
        self.match_dict = match_dict
```

--------------------------------

### Initialize ROUGE Wrapper

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/metrics/rouge.html

Initializes the ROUGE class, optionally with a custom tokenizer. If no tokenizer is provided, a default one with a space delimiter is used.

```python
from string2string.misc.default_tokenizer import Tokenizer

# Initialize with default tokenizer
rouge_wrapper = ROUGE()

# Initialize with a custom tokenizer
custom_tokenizer = Tokenizer(word_delimiter='\t')
rouge_wrapper_custom = ROUGE(tokenizer=custom_tokenizer)
```

--------------------------------

### Get Character Pair Score

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/alignment/classical.html

Returns the score for a pair of characters based on match, mismatch, or gap weights. Handles cases where characters are identical, one is a gap, or they are different.

```python
def get_score(self,
        c1: Union[str, List[str]],
        c2: Union[str, List[str]],
    ) -> float:
        """
        This function returns the score of a character or string pair.

        Arguments:
            c1 (str or list of str): The first character or string.
            c2 (str or list of str): The second character or string.

        Returns:
            The score of the character or string pair.
        """
        # If the characters are the same, return the match weight.
        if c1 == c2:
            return self.match_weight
        # If one of the characters is a gap, return the gap weight.
        elif c1 == self.gap_char or c2 == self.gap_char:
            return self.gap_weight
        # Otherwise, return the mismatch weight.
        else:
            return self.mismatch_weight
```

--------------------------------

### Get Last Hidden State Embedding

Source: https://string2string.readthedocs.io/en/latest/_modules/string2string/search/faiss_search.html

Extracts the last hidden state, typically the [CLS] token's representation, from input embeddings. This method is useful for sentence-level embeddings.

```python
last_hidden_state = embeddings.last_hidden_state
return last_hidden_state[:, 0, :]
```