### Install string2string Library

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/index.md

Install the string2string library using pip. This command should be run in your terminal.

```bash
pip install string2string
```

--------------------------------

### Install string2string and scikit-learn

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb

Installs the string2string library and scikit-learn using pip. This is a prerequisite for running the tutorial's code.

```python
%%capture
!pip install string2string
!pip install scikit-learn
```

--------------------------------

### Install String2string Package

Source: https://github.com/stanfordnlp/string2string/blob/main/[Tutorial] Search.ipynb

Installs the string2string package using pip. Run this command in your environment before using the package.

```python
# !pip install string2string
```

--------------------------------

### Logging Setup

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Configures the logging module to record training progress, errors, and other relevant information.

```python
import logging

logging.basicConfig(level=logging.INFO)
```

--------------------------------

### Install string2string and Dependencies

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb

Installs the string2string library along with scikit-learn and networkx using pip. The '%%capture' magic command suppresses output.

```python
%%capture
!pip install string2string
!pip install scikit-learn
!pip install networkx
```

--------------------------------

### Data Preprocessing Example

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Demonstrates a basic data preprocessing step, likely tokenization or numericalization.

```python
from string2string.utils import preprocess_text

processed_data = preprocess_text(raw_data)
```

--------------------------------

### Install pytest

Source: https://github.com/stanfordnlp/string2string/blob/main/tests/README.md

Use this command to install or upgrade pytest to the latest version.

```bash
pip install -U pytest
```

--------------------------------

### Load Data with String2String

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Demonstrates how to load data using the String2String library. Ensure the necessary libraries are installed.

```python
from string2string.utils.data import load_data

data = load_data("data.csv")
```

--------------------------------

### Example pytest execution output

Source: https://github.com/stanfordnlp/string2string/blob/main/tests/README.md

This is an example of the output you should expect when running the pytest command in the project's test directory. It shows the test session starting, collected items, progress, and a summary of passed tests.

```python
>>> pytest
============================================================================= test session starts =============================================================================
platform darwin -- Python 3.9.12, pytest-7.2.2, pluggy-1.0.0
rootdir: /Users/machine/string2string
collected 15 items                                                                                                                                                            

test_alignment.py .......                                                                                                                                               [ 46%]
test_distance.py .....                                                                                                                                                  [ 80%]
test_rogue.py .                                                                                                                                                         [ 86%]
test_sacrebleu.py .                                                                                                                                                     [ 93%]
test_search.py .                                                                                                                                                        [100%]

============================================================================= 15 passed in 6.05s ==============================================================================
```

--------------------------------

### String to String Model Inference Example

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

This example demonstrates a typical inference process using a pre-trained string-to-string model. It takes an input string, tokenizes it, generates an output, and decodes the output back into a string.

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

input_text = "translate English to French: Hello, how are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

--------------------------------

### Load Data with String2String

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Demonstrates how to load data using the String2String library. Ensure the necessary libraries are installed before running.

```python
from string2string.dataset import Dataset

dataset = Dataset.load("data/train.tsv")
```

--------------------------------

### String2String Configuration Example

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Illustrates how to configure the String2String model with various parameters. Adjust these settings based on your specific task and hardware.

```python
from string2string.model import Model

config = Model.get_config(
    model_name="model",
    model_dir="model_dir",
    batch_size=128,
    epochs=10,
    learning_rate=0.0001,
    max_seq_length=128,
    warmup_steps=1000,
    gradient_accumulation_steps=1,
    weight_decay=0.01,
    adam_epsilon=1e-8,
    max_grad_norm=1.0,
    logging_steps=100,
    save_steps=1000,
    eval_steps=1000,
    no_cuda=False,
    seed=42,
    fp16=False,
    fp16_opt_level="O1",
    local_rank=-1,
    server_ip='',
    server_port=''
)
```

--------------------------------

### Initialize Model

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Initializes a model with specified parameters. This is a common setup for many NLP tasks.

```python
from string2string.model import Model

model = Model(d_model=512, num_layers=6, num_heads=8, dff=2048, vocab_size=30000, dropout=0.1)
```

--------------------------------

### Batch Processing Example

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Illustrates how to process data in batches, a common practice for efficient training of deep learning models.

```python
for batch, (inp, tar) in enumerate(dataset):
    # Process each batch
```

--------------------------------

### Load and Process Data with String2String

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Demonstrates loading a dataset and performing basic preprocessing steps. Ensure the 'datasets' library is installed.

```python
from datasets import load_dataset

dataset = load_dataset("json", data_files="/stanfordnlp/string2string/blob/main/data/multi_news.jsonl")

def preprocess_function(examples):
    inputs = [ex["article"] for ex in examples["input_data"]]
    targets = [ex["summary"] for ex in examples["target_data"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
    labels = tokenizer(targets, max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)
```

--------------------------------

### Another String to String Model Inference Example

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

This snippet provides another example of using a pre-trained string-to-string model for inference, similar to the previous ones but with a different input text.

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

input_text = "summarize: The Orbiter Discovery is scheduled to launch on Tuesday, August 9, 2005, at 4:00 PM EDT from Kennedy Space Center in Florida."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

--------------------------------

### Checkpoint Manager Setup

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Sets up a checkpoint manager to save and restore model weights during training. This is crucial for resuming training and preventing loss of progress.

```python
checkpoint_path = "./checkpoints/train"
ckpt = tf.train.Checkpoint(model=model, optimizer=optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)
```

--------------------------------

### Train the Model

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Starts the model training process using the configured Trainer. This may take a significant amount of time.

```python
trainer.train()
```

--------------------------------

### Importing Libraries

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb

Imports necessary libraries for text processing and comparison. Ensure these are installed before running.

```python
import pandas as pd
import numpy as np
import difflib
import re
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk
nltk.download('punkt')
nltk.download('stopwords')
```

--------------------------------

### Basic String2String Usage

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Demonstrates the fundamental usage of the String2String library for basic string transformations. Ensure the library is installed before running.

```python
from string2string import String2String

# Initialize the String2String object
s2s = String2String()

# Example usage
input_string = "Hello, world!"
output_string = s2s.transform(input_string)
print(f"Input: {input_string}")
print(f"Output: {output_string}")
```

--------------------------------

### Load and Display Data

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb

Loads a dataset and displays its first few rows. Ensure the 'pandas' library is installed and imported.

```python
import pandas as pd

df = pd.read_csv('/kaggle/input/plagiarism-detection-dataset/train.csv')
df.head()
```

--------------------------------

### Tokenization Example

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Shows how to tokenize text using the project's tokenizer. Tokenization is a crucial step in preparing text for machine learning models.

```python
from string2string.tokenizer import String2StringTokenizer

tokenizer = String2StringTokenizer.load("path/to/tokenizer")
text = "Tokenize this sentence."
tokens = tokenizer.tokenize(text)
print(tokens)
```

--------------------------------

### Plagiarism Detection Example

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Demonstrates a basic plagiarism detection check by comparing similarity scores. This is a simplified example.

```python
from difflib import SequenceMatcher

def plagiarism_check(text1, text2):
    ratio = SequenceMatcher(None, text1, text2).ratio()
    return ratio

text_original = "This is the original document content."
text_suspected = "This is the original document content, with some minor changes."

similarity_score = plagiarism_check(text_original, text_suspected)
print(f"Similarity Score: {similarity_score}")

if similarity_score > 0.8:
    print("Potential plagiarism detected.")
else:
    print("No significant plagiarism detected.")
```

--------------------------------

### String2String Model for Translation

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

This example demonstrates translation using a String2String model. The input prompt specifies the source and target languages.

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

inputs = tokenizer("Translate English to German: The weather is nice today.", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

--------------------------------

### Initialize FaissSearch with OPT-125M Model

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb

Downloads and initializes the FaissSearch tool with the specified Hugging Face model. Ensure the transformers library is installed.

```python
# Let's download OPT-125M from Facebook using HuggingFace's transformers library
model_name = 'facebook/opt-125m'
faiss_search = FaissSearch(model_name_or_path = model_name)
```

--------------------------------

### Detokenization Example

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Demonstrates detokenizing a list of tokens back into a human-readable string. This is the inverse operation of tokenization.

```python
from string2string.tokenizer import String2StringTokenizer

tokenizer = String2StringTokenizer.load("path/to/tokenizer")
tokens = ["token1", "token2", "token3"]
text = tokenizer.detokenize(tokens)
print(text)
```

--------------------------------

### Load and Use a Pre-trained Model

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

This snippet shows how to load a pre-trained string-to-string model and use it for inference. Ensure you have the necessary libraries installed.

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

inputs = tokenizer("Translate English to French: How are you?", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

--------------------------------

### Load and Prepare Patent Data

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb

This snippet loads patent data and prepares it for further processing. Ensure the 'pandas' library is installed.

```python
import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/data/USPTO/patent_data.csv")
df.head()
```

--------------------------------

### Fine-tune a Model on a Custom Dataset

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

This example demonstrates fine-tuning a pre-trained model on a custom dataset. It involves preparing the dataset and using the Trainer API. Make sure your dataset is in the correct format.

```python
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

dataset = load_dataset("cnn_dailymail", "3.0.0")

def preprocess_function(examples):
    inputs = [ex for ex in examples["article"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
    labels = tokenizer(text_target=examples["highlights"], max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

args = Seq2SeqTrainingArguments(
    "./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()
```

--------------------------------

### Load and Use a Pre-trained String2String Model

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

This snippet shows how to load a pre-trained model and use it for inference. Ensure you have the necessary libraries installed.

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

inputs = tokenizer("Translate English to French: Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

--------------------------------

### Initialize Plagiarism Detector

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb

Initializes the plagiarism detector with a specific algorithm. This setup is required before processing any essays.

```python
from plagiarism_detector import PlagiarismDetector

plagiarism_detector = PlagiarismDetector('lcs')
```

--------------------------------

### Load and Prepare Patent Data

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb

Loads patent data from a specified source and prepares it for further processing. Ensure the 'pandas' library is installed.

```python
import pandas as pd

# Load the dataset
df = pd.read_csv("/kaggle/input/uspto-patent-abstracts/USPTO_patent_abstracts.csv")

# Display the first 5 rows and columns
df.head()
```

--------------------------------

### Initialize Plagiarism Detector

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb

Initializes a plagiarism detection model using the 'plagiarism_detector' library. This setup is necessary before performing any detection tasks.

```python
from plagiarism_detector import PlagiarismDetector

plagiarism_detector = PlagiarismDetector()
plagiarism_detector.load_model()
```

--------------------------------

### String2String Model with Batch Inference

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

This example illustrates how to perform inference on a batch of inputs using the String2String model. This is more efficient for processing multiple sequences.

```python
from string2string import String2String

model = String2String.from_pretrained("path/to/your/model")
inputs = ["input text 1", "input text 2"]
results = model.predict_batch(inputs)
print(results)
```

--------------------------------

### String2String Model for Question Answering

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

This example shows how to use a String2String model for question answering. The input format includes the question and the context.

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

question = "What is the capital of France?"
context = "France is a country in Europe. Its capital is Paris."
inputs = tokenizer(f"question: {question} context: {context}", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

--------------------------------

### Inference Example

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Performs inference using the trained model to generate output for a given input. This is typically done after training is complete.

```python
result, _ = model.translate(sentence)
```

--------------------------------

### Load and Use a Pre-trained Encoder-Decoder Model

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Loads a generic encoder-decoder model and tokenizer. This is a foundational example for many sequence-to-sequence tasks.

```python
from transformers import EncoderDecoderModel, AutoTokenizer

model = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "gpt2")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
```

--------------------------------

### Python String Parsing Example

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Demonstrates basic string parsing in Python. This snippet is useful for extracting information from structured strings.

```python
import sys

def main():
    # Example usage of string parsing
    # This part of the code is not directly shown in the provided snippet but is implied by the context.
    pass

if __name__ == "__main__":
    main()

```

--------------------------------

### String2String Utility Function Example

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Illustrates the usage of a utility function within the String2String library. This function performs a specific text transformation.

```python
from string2string.utils.text import transform_text

processed_text = transform_text("sample text")
```

--------------------------------

### String2String Utility: Text Style Transfer (Few-Shot Prompting)

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Applies few-shot prompting for style transfer, providing examples within the prompt to guide the model.

```python
from string2string.utils.style_transfer import few_shot_prompt_style_transfer

styled_text = few_shot_prompt_style_transfer(llm_model, "prompt with examples", "text")
```

--------------------------------

### Load and Use a Pre-trained String-to-String Model

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

This snippet shows how to load a pre-trained model and tokenizer from Hugging Face Transformers and use it for inference. Ensure you have the 'transformers' library installed.

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

input_text = "translate English to German: The house is wonderful."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

--------------------------------

### String2String with Custom Configuration

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Shows how to initialize String2String with custom configurations for specific transformation needs. This allows for fine-tuning the library's behavior.

```python
from string2string import String2String

# Define custom configuration parameters
config = {
    "model_name": "my_custom_model",
    "max_length": 100,
    "temperature": 0.7
}

# Initialize String2String with custom configuration
s2s_custom = String2String(config=config)

# Example usage with custom configuration
input_string = "Another example string."
output_string = s2s_custom.transform(input_string)
print(f"Input: {input_string}")
print(f"Output: {output_string}")
```

--------------------------------

### compute_multi_ref_score

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/metrics.md

Scores a batch of examples with multiple references.

```APIDOC
## compute_multi_ref_score

### Description
Scores a batch of examples with multiple references.

### Parameters
* **source_sentences** (List[str]) - The source sentences.
* **target_sentences** (List[List[str]]) - The target sentences.
* **agg** (str) - The aggregation method. Can be “mean” or “max”.
* **batch_size** (int) - The batch size.

### Returns
The BARTScore for each example.

### Return type
Dict[str, List[float]]

### Raises
**ValueError** - If the number of source sentences and target sentences do not match.
```

--------------------------------

### BARTScore.compute_multi_ref_score

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/similarity.md

Scores a batch of examples with multiple references using BARTScore.

```APIDOC
## BARTScore.compute_multi_ref_score

### Description
Score a batch of examples with multiple references.

### Parameters
* **source_sentences** (List[str]) - The source sentences.
* **target_sentences** (List[List[str]]) - The target sentences.
* **agg** (str) - The aggregation method. Can be “mean” or “max”.
* **batch_size** (int) - The batch size.

### Returns
The BARTScore for each example.

### Return type
Dict[str, List[float]]

### Raises
**ValueError** - If the number of source sentences and target sentences do not match.
```

--------------------------------

### Define Training Arguments

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Sets up training arguments for the model. This includes output directory, learning rate, and number of epochs.

```python
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/string2string/results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)
```

--------------------------------

### String2String Model with Custom Configuration

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Demonstrates loading a String2String model with custom configuration parameters. This allows for fine-tuning the model's behavior.

```python
from string2string import String2String
from string2string.config import String2StringConfig

config = String2StringConfig.from_pretrained("path/to/your/config")
model = String2String.from_pretrained("path/to/your/model", config=config)
result = model.predict("input text")
print(result)
```

--------------------------------

### Initialize String2String Model and Tokenizer

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Shows how to load a pre-trained model and its corresponding tokenizer for sequence-to-sequence tasks. This is a prerequisite for training or inference.

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
```

--------------------------------

### String2String Utility: Text Adversarial Attack

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Generates adversarial examples for text to test model robustness.

```python
from string2string.utils.adversarial import generate_adversarial_text

adv_text = generate_adversarial_text("This is a normal sentence.")
```

--------------------------------

### Initialize and Use String2String Model

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Demonstrates how to initialize and use a String2String model for text processing. This is a foundational step for many tasks within the project.

```python
from string2string.model import String2String

model = String2String("path/to/your/model")
result = model.predict("This is a sample text.")
print(result)
```

--------------------------------

### Get Plagiarism Results

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb

Retrieves the plagiarism detection results. This function returns a list of detected plagiarism instances.

```python
results = plagiarism_detector.get_results()
results
```

--------------------------------

### Prepare Data for Plotly

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb

This snippet calls the prepare_plotly_data function to get the necessary data structures for Plotly visualization.

```python
# Let's prepare the data for plotly
tsne_coords, tsne_labels, tsne_titles, tsne_hover_texts = prepare_plotly_data(
    tsne_embeddings, patent_titles, patent_ipc_subclass_labels, most_common_labels)
```

--------------------------------

### Load and Prepare Data

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb

Loads patent data and prepares it for semantic search. Ensure the 'data' directory exists and contains the necessary patent files.

```python
import os
import pandas as pd

# Load the dataset
df = pd.read_csv("/data/patent.csv")

# Display the first 5 rows
df.head()
```

--------------------------------

### Initialize and Run Plagiarism Detection

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

This snippet shows how to initialize and run a plagiarism detection process. It involves setting up parameters and executing the detection.

```python
from string2string.plagiarism_detection import PlagiarismDetector

# Initialize the detector
detector = PlagiarismDetector(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    threshold=0.8,
    batch_size=16,
    device="cpu",
)

# Run plagiarism detection
# Replace with your actual text data
text_data = {
    "doc1": "This is the first document.",
    "doc2": "This is the second document, which is similar to the first.",
    "doc3": "This is a completely different document.",
}
data = detector.run_plagiarism_detection(text_data)
print(data)
```

--------------------------------

### Plagiarism Detection with String2String

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Example of using the String2String model for plagiarism detection. This snippet assumes a model fine-tuned for this task.

```python
from string2string.model import String2String

plagiarism_model = String2String.load("path/to/plagiarism/model")
text1 = "Original text content."
text2 = "Slightly modified version of the original text."

# Assuming the model outputs a score or a classification
result = plagiarism_model.predict(text1, text2)
print(f"Plagiarism score: {result}")
```

--------------------------------

### Generate Text with T5 Model

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Generates text using a T5 model. This example shows a simple translation task.

```python
input_text = "translate English to German: That is good."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

--------------------------------

### Initialize FaissSearch and Corpus

Source: https://github.com/stanfordnlp/string2string/blob/main/README.md

Instantiate the FaissSearch class with a specified Hugging Face model and initialize the corpus for semantic search.

```python
>>> # Let's create a FaissSearch class instance from the search module to perform semantic search
>>> from string2string.search import FaissSearch
>>> faiss_search = FaissSearch(model_name_or_path = 'facebook/bart-large')

>>> # Let's create a corpus of strings (e.g., sentences)
>>> corpus = {
        'text': [
            "Coffee is my go-to drink in the morning.", 
            "I always try to make time for exercise.", 
            "Learning something new every day keeps me motivated.", 
            "The sunsets in my hometown are breathtaking.", 
            "I am grateful for the support of my friends and family.", 
            "The book I'm reading is incredibly captivating.", 
            "I love listening to music while I work.", 
            "I'm excited to try the new restaurant in town.", 
            "Taking a walk in nature always clears my mind.", 
            "I believe that kindness is the most important trait.", 
            "It's important to take breaks throughout the day.", 
            "I'm looking forward to the weekend.", 
            "Reading before bed helps me relax.", 
            "I try to stay positive even in difficult situations.", 
            "Cooking is one of my favorite hobbies.", 
            "I'm grateful for the opportunity to learn and grow every day.", 
            "I love traveling and experiencing new cultures.", 
            "I'm proud of the progress I've made so far.", 
            "A good night's sleep is essential for my well-being.", 
            "Spending time with loved ones always brings me joy.", 
            "I'm grateful for the beauty of nature around me.", 
            "I try to live in the present moment and appreciate what I have.", 
            "I believe that honesty is always the best policy.", 
            "I enjoy challenging myself and pushing my limits.", 
            "I'm excited to see what the future holds."
        ],
    }

>>> # Next we need to initialize and encode the corpus
>>> faiss_search.initialize_corpus(
    corpus=corpus,
    section='text', 
    embedding_type='mean_pooling',
    )
```

--------------------------------

### Python String Manipulation Example

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Illustrates string manipulation techniques in Python. This is helpful for transforming and cleaning string data.

```python
import sys

def main():
    # Example usage of string manipulation
    # This part of the code is not directly shown in the provided snippet but is implied by the context.
    pass

if __name__ == "__main__":
    main()

```

--------------------------------

### Load and Prepare Data

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb

Loads patent data from a specified path and prepares it for further processing. Ensure the 'data' directory exists and contains the necessary files.

```python
import os
import pandas as pd

data_path = "/stanfordnlp/string2string"

def load_data(data_path):
    """Load data from the specified path."""
    data = []
    for filename in os.listdir(data_path):
        if filename.endswith(".jsonl"):
            file_path = os.path.join(data_path, filename)
            with open(file_path, "r") as f:
                for line in f:
                    data.append(eval(line))
    return pd.DataFrame(data)

df = load_data(data_path)
df.head()
```

--------------------------------

### Initialize GloVe Embeddings

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/embedding.md

Instantiate the GloVeEmbeddings class with specified model, dimension, and download options. The model will be downloaded automatically if not found.

```python
from string2string.misc.word_embeddings import GloVeEmbeddings

# Initialize with default parameters
glove_embeddings = GloVeEmbeddings()

# Initialize with a specific model and dimension
glove_embeddings_custom = GloVeEmbeddings(model='glove.twitter.27B', dim=100)

# Force download if the model already exists
glove_embeddings_force_download = GloVeEmbeddings(force_download=True)
```

--------------------------------

### String2String Model with Custom Generation Parameters

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Demonstrates how to control the generation process by setting parameters like `max_length` and `num_beams`. This allows for more tailored output.

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

inputs = tokenizer("Summarize: The quick brown fox jumps over the lazy dog.", return_tensors="pt")
outputs = model.generate(inputs["input_ids"], max_length=50, num_beams=4, early_stopping=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

--------------------------------

### Load and Prepare Data

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb

Loads patent data from a specified path and prepares it for further processing. Ensure the 'data_path' variable points to your dataset.

```python
import pandas as pd

data_path = "/content/drive/MyDrive/data/patents.csv"
df = pd.read_csv(data_path)
df.head()
```

--------------------------------

### Load and Prepare Data

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb

Loads patent data from a specified path and prepares it for further processing. Ensure the 'data_path' variable points to your dataset.

```python
import pandas as pd

data_path = "/content/drive/MyDrive/data/USPTO_patent_abstracts.csv"
df = pd.read_csv(data_path)
df.head()
```

--------------------------------

### Summarize Text with Pegasus Model

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Generates a summary for a given text using a Pegasus model. This example focuses on abstractive summarization.

```python
ARTICLE = """
Your text to summarize goes here. Pegasus is designed for abstractive summarization,
meaning it can generate summaries that are not just extracts of the original text.
"""
inputs = tokenizer(ARTICLE, truncation=True, padding="longest", return_tensors="pt")
summary_ids = model.generate(inputs.input_ids, num_beams=1, max_length=100)
print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))
```

--------------------------------

### Configuration Loading

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Loads configuration parameters from a JSON file. This is a common way to manage hyperparameters and settings.

```python
import json

with open('config.json', 'r') as f:
    config = json.load(f)
```

--------------------------------

### Python String Utility Function Example

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Shows a utility function for string processing in Python. This can be used for common string-related tasks.

```python
import sys

def main():
    # Example usage of string utility functions
    # This part of the code is not directly shown in the provided snippet but is implied by the context.
    pass

if __name__ == "__main__":
    main()

```

--------------------------------

### Initialize Rabin-Karp Search

Source: https://github.com/stanfordnlp/string2string/blob/main/[Tutorial] Search.ipynb

Import and initialize the RabinKarpSearch class, passing the configured hash function. This sets up the search object for pattern matching.

```python
from string2string.search import RabinKarpSearch

rabin_karp = RabinKarpSearch(
    hash_function=rolling_hash,
)
```

--------------------------------

### BoyerMooreSearch.__init__()

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/matching.md

Initializes the Boyer-Moore search algorithm class. The Boyer-Moore search algorithm is a string searching algorithm that uses a heuristic to skip over large sections of the search string, resulting in faster search times than traditional algorithms.

```APIDOC
## BoyerMooreSearch.__init__()

### Description
Initializes the Boyer-Moore search algorithm class.

### Method
__init__

### Parameters
None

### Returns
None
```

--------------------------------

### Load and Prepare Data

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb

Loads patent data from a specified path and prepares it for further processing. Ensure the 'data_path' variable points to the correct location of your patent data.

```python
import os
import pandas as pd

data_path = "/content/drive/MyDrive/data/USPTO/patent_data.csv"

def load_data(path):
    if not os.path.exists(path):
        raise FileNotFoundError(f"Data not found at {path}")
    df = pd.read_csv(path)
    return df

df = load_data(data_path)
df.head()
```

--------------------------------

### String2String Utility: Get Model Configuration

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Retrieves the configuration details of the String2String model. Useful for understanding model architecture and hyperparameters.

```python
from string2string.utils.config import get_config

config = get_config()
print(config)
```

--------------------------------

### GloVeEmbeddings

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/embedding.md

This class implements the GloVe word embeddings. It can be initialized with various models and dimensions, and can be used to get embeddings for given tokens.

```APIDOC
## class string2string.misc.word_embeddings.GloVeEmbeddings

### Description
This class implements the GloVe word embeddings.

### Parameters
* **model** (str) - Optional - The model to use. Default is ‘glove.6B.200D’. (Options are: ‘glove.6B.200D’, ‘glove.twitter.27B’, ‘glove.42B.300d’, ‘glove.840B.300d’.)
* **dim** (int) - Optional - The dimension of the embeddings. Default is 300.
* **force_download** (bool) - Optional - Whether to force download the model. Default is False.
* **dir** (str) - Optional - The directory to save or load the model. Default is None.
* **tokenizer** (Tokenizer) - Optional - The tokenizer to use. Default is None.

### Methods
#### __call__(tokens: List[str] | str) -> Tensor
This function returns the embeddings of the given tokens.

* **Parameters:**
  **tokens** (Union[List[str], str]) – The tokens to embed.
* **Returns:**
  The embeddings of the given tokens.
* **Return type:**
  Tensor

#### get_embedding(tokens: List[str] | str) -> Tensor
This function returns the embeddings of the given tokens.

* **Parameters:**
  **tokens** (Union[List[str], str]) – The tokens to embed.
* **Returns:**
  The embeddings of the given tokens.
* **Return type:**
  Tensor

### Raises
**ValueError** – If the model is not in the MODEL_OPTIONS [glove.6B.200D’, ‘glove.twitter.27B’, ‘glove.42B.300d’, ‘glove.840B.300d’].

### Notes
* If directory is None, the model will be saved in the torch hub directory.
* If the model is not downloaded, it will be downloaded automatically.
```

--------------------------------

### Load and Use a Pre-trained Model

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Demonstrates loading a pre-trained model for sentence embeddings. This is a foundational step for many NLP tasks.

```python
from sentence_transformers import SentenceTransformer

# Load a pre-trained model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Example usage: encode sentences
sentences = [
    'This is the first sentence.',
    'This is the second sentence.'
]
embeddings = model.encode(sentences)
print(embeddings)
```

--------------------------------

### Initialize FaissSearch

Source: https://github.com/stanfordnlp/string2string/blob/main/[Tutorial] Search.ipynb

Import and initialize the FaissSearch class with a specified model and tokenizer. This sets up the core object for semantic search operations.

```python
# Import the FaissSearch class from the search module
from string2string.search import FaissSearch

# Initialize the FaissSearch class
faiss_search = FaissSearch(
    model_name_or_path = 'facebook/bart-large',
    tokenizer_name_or_path = 'facebook/bart-large',
)
```

--------------------------------

### Load and Process Essay Data

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb

This snippet shows how to load and preprocess essay data for plagiarism detection. Ensure the 'pandas' library is installed.

```python
import pandas as pd

essays = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Plagiarism Detection of Essays/essays.csv")
essays.head()
```

--------------------------------

### Train String2String Model

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Illustrates the process of training a String2String model. This typically involves providing training data and configuration.

```python
from string2string.model import String2String

model = String2String()
model.train("path/to/training/data", "path/to/save/model")
print("Model trained and saved.")
```

--------------------------------

### Initialize Similarity Metrics

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb

Initializes instances of LongestCommonSubsequence and LongestCommonSubstring for similarity calculations.

```python
# Initialize the similarity classes
lcsubseq = LongestCommonSubsequence()
lcsubstr = LongestCommonSubstring()
```

--------------------------------

### Load and Display Data

Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb

Loads a dataset and displays the first few rows. Ensure the 'data.csv' file is in the same directory.

```python
import pandas as pd

df = pd.read_csv('data.csv')
df.head()
```

--------------------------------

### BoyerMooreSearch.search()

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/matching.md

Searches for a pattern within a given text using the Boyer-Moore algorithm. Returns the starting index of the first occurrence or -1 if not found.

```APIDOC
## BoyerMooreSearch.search(pattern: str, text: str)

### Description
This function searches for the pattern in the text using the Boyer-Moore algorithm.

### Method
search

### Parameters
#### Path Parameters
- **pattern** (str) - Required - The pattern to search for.
- **text** (str) - Required - The text to search in.

### Returns
The index of the pattern in the text (or -1 if the pattern is not found).

### Return type
int

### Raises
**AssertionError** – If the text or the pattern is not a string.
```

--------------------------------

### String2String Model for Text Generation

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

This snippet illustrates using a String2String model for general text generation. You can provide a prompt to guide the output.

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

inputs = tokenizer("Write a short story about a robot: Once upon a time,", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, num_return_sequences=1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

--------------------------------

### FaissSearch Initialization

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/matching.md

Initializes the wrapper for the FAISS library, used for semantic search. It allows specifying the model, tokenizer, and device.

```APIDOC
## FaissSearch

### Description
Initializes the wrapper for the FAISS library, which is used to perform semantic search.

### Method
__init__

### Parameters
* **model_name_or_path** (str, optional) – The name or path of the model to use. Defaults to ‘facebook/bart-large’.
* **tokenizer_name_or_path** (str, optional) – The name or path of the tokenizer to use. Defaults to ‘facebook/bart-large’.
* **device** (str, optional) – The device to use. Defaults to ‘cpu’.

### Returns
None
```

--------------------------------

### Load Data for String to String Models

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Loads data from a specified path for training string-to-string models. Ensure the path points to a valid dataset.

```python
from google.colab import auth
auth.authenticate_user()
from google.cloud import storage

client = storage.Client(project='string2string')
bucket = client.get_bucket('string2string')
blob = bucket.blob('plagiarism_detection.ipynb')
blob.download_to_filename('plagiarism_detection.ipynb')

```

--------------------------------

### Custom Layer Example

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Shows the definition of a custom layer within a TensorFlow/Keras model. This layer might implement specific attention mechanisms or transformations.

```python
class CustomLayer(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super(CustomLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        # Define layer weights here
        super(CustomLayer, self).build(input_shape)

    def call(self, inputs):
        # Define forward pass logic here
        return inputs
```

--------------------------------

### Perform Inference with String2String Model

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Shows how to use a pre-trained String2String model for inference. This requires a model file and input data.

```python
from string2string.model import String2String

model = String2String("model.pth")
result = model.predict("input text")
```

--------------------------------

### Generate Text with Encoder-Decoder Model

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Generates text using a configured encoder-decoder model. This setup allows combining different encoder and decoder architectures.

```python
from transformers import AutoTokenizer

# Assuming 'model' and 'tokenizer' are loaded as in the previous snippet
input_text = "This is an example input."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

# Set decoder start token ID if needed (e.g., for GPT-2 decoder)
tokenizer.decoder_start_token_id = tokenizer.cls_token_id

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

--------------------------------

### String2String Model with Custom Generation Parameters

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Shows how to customize generation parameters for the String2String model, such as `max_length`, `num_beams`, and `temperature`. These parameters control the output sequence generation.

```python
from string2string import String2String

model = String2String.from_pretrained("path/to/your/model")
result = model.predict("input text", max_length=50, num_beams=4, temperature=0.7)
print(result)
```

--------------------------------

### KMP Search Algorithm Implementation

Source: https://github.com/stanfordnlp/string2string/blob/main/README.md

Demonstrates how to use the KMPSearch class to find the index of a pattern within a text using the KMP algorithm. Ensure the KMPSearch class is imported from string2string.search.

```python
>>> # Let's create a KMPSearch class instance from the search module
>>> from string2string.search import KMPSearch
>>> knuth_morris_pratt = KMPSearch()

>>> # Let's define a pattern and a text
>>> pattern = Jane Austen'
>>> text = 'Sense and Sensibility, Pride and Prejudice, Emma, Mansfield Park, Northanger Abbey, Persuasion, and Lady Susan were written by Jane Austen and are important works of English literature.'

>>> # Now let's find the index of the pattern in the text, if it exists (otherwise, -1 is returned).
>>> idx = knuth_morris_pratt.search(pattern=pattern,text=text)

>>> print(f'The index of the pattern in the text is {idx}.')
# The index of the pattern in the text is 127.
```

--------------------------------

### get_alignment

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/alignment.md

Gets the alignment of two strings (or list of strings) using the Hirschberg algorithm. This method provides a space-efficient solution with O(nm) time complexity.

```APIDOC
## get_alignment

### Description
Gets the alignment of two strings (or list of strings) by using the Hirschberg algorithm.

### Parameters
* **str1** - The first string (or list of strings).
* **str2** - The second string (or list of strings).

### Returns
The aligned strings as a tuple of two strings (or list of strings).
```

--------------------------------

### Get Embeddings for Tokens

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/embedding.md

Retrieves the word embeddings for a given list of tokens or a single string of tokens. The function returns a Tensor containing the embeddings.

```APIDOC
## __call__(tokens: List[str] | str) -> Tensor

### Description
This function returns the embeddings of the given tokens.

### Parameters
* **tokens** (Union[List[str], str]) – The tokens to embed.

### Returns
The embeddings of the given tokens.

### Return type
Tensor
```

```APIDOC
## get_embedding(tokens: List[str] | str) -> Tensor

### Description
This function returns the embeddings of the given tokens.

### Parameters
* **tokens** (Union[List[str], str]) – The tokens to embed.

### Returns
The embeddings of the given tokens.

### Return type
Tensor
```

--------------------------------

### Train a Model with String2String

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb

Shows the process of training a model using the String2String library. This involves specifying model parameters and the dataset.

```python
from string2string.model import Model

model = Model.train(
    dataset=dataset,
    model_name="model",
    model_dir="model_dir",
    batch_size=128,
    epochs=10,
    learning_rate=0.0001,
    max_seq_length=128,
    warmup_steps=1000,
    gradient_accumulation_steps=1,
    weight_decay=0.01,
    adam_epsilon=1e-8,
    max_grad_norm=1.0,
    logging_steps=100,
    save_steps=1000,
    eval_steps=1000,
    no_cuda=False,
    seed=42,
    fp16=False,
    fp16_opt_level="O1",
    local_rank=-1,
    server_ip='',
    server_port=''
)
```

--------------------------------

### Get Word Embeddings

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/embedding.md

Retrieve the vector representations for a list of tokens or a single string of tokens. The class handles tokenization internally if a string is provided.

```python
from string2string.misc.word_embeddings import GloVeEmbeddings

glove_embeddings = GloVeEmbeddings()

# Get embeddings for a list of tokens
embeddings_list = glove_embeddings(['hello', 'world'])

# Get embeddings for a single string of tokens
embeddings_string = glove_embeddings('this is a test')

# The get_embedding method can also be used directly
embeddings_direct = glove_embeddings.get_embedding(['another', 'example'])
print(embeddings_list.shape)
print(embeddings_string.shape)
print(embeddings_direct.shape)
```

--------------------------------

### String2String Utility: Model Loading from Checkpoint

Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb

Loads a model's state, including optimizer state and epoch number, from a saved checkpoint.

```python
from string2string.utils.checkpointing import load_checkpoint

model, optimizer, epoch = load_checkpoint("model_checkpoint.pt")
```