### Install string2string Library Source: https://github.com/stanfordnlp/string2string/blob/main/docs/index.md Install the string2string library using pip. This command should be run in your terminal. ```bash pip install string2string ``` -------------------------------- ### Install string2string and scikit-learn Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb Installs the string2string library and scikit-learn using pip. This is a prerequisite for running the tutorial's code. ```python %%capture !pip install string2string !pip install scikit-learn ``` -------------------------------- ### Install String2string Package Source: https://github.com/stanfordnlp/string2string/blob/main/[Tutorial] Search.ipynb Installs the string2string package using pip. Run this command in your environment before using the package. ```python # !pip install string2string ``` -------------------------------- ### Logging Setup Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Configures the logging module to record training progress, errors, and other relevant information. ```python import logging logging.basicConfig(level=logging.INFO) ``` -------------------------------- ### Install string2string and Dependencies Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb Installs the string2string library along with scikit-learn and networkx using pip. The '%%capture' magic command suppresses output. ```python %%capture !pip install string2string !pip install scikit-learn !pip install networkx ``` -------------------------------- ### Data Preprocessing Example Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Demonstrates a basic data preprocessing step, likely tokenization or numericalization. ```python from string2string.utils import preprocess_text processed_data = preprocess_text(raw_data) ``` -------------------------------- ### Install pytest Source: https://github.com/stanfordnlp/string2string/blob/main/tests/README.md Use this command to install or upgrade pytest to the latest version. ```bash pip install -U pytest ``` -------------------------------- ### Load Data with String2String Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Demonstrates how to load data using the String2String library. Ensure the necessary libraries are installed. ```python from string2string.utils.data import load_data data = load_data("data.csv") ``` -------------------------------- ### Example pytest execution output Source: https://github.com/stanfordnlp/string2string/blob/main/tests/README.md This is an example of the output you should expect when running the pytest command in the project's test directory. It shows the test session starting, collected items, progress, and a summary of passed tests. ```python >>> pytest ============================================================================= test session starts ============================================================================= platform darwin -- Python 3.9.12, pytest-7.2.2, pluggy-1.0.0 rootdir: /Users/machine/string2string collected 15 items test_alignment.py ....... [ 46%] test_distance.py ..... [ 80%] test_rogue.py . [ 86%] test_sacrebleu.py . [ 93%] test_search.py . [100%] ============================================================================= 15 passed in 6.05s ============================================================================== ``` -------------------------------- ### String to String Model Inference Example Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb This example demonstrates a typical inference process using a pre-trained string-to-string model. It takes an input string, tokenizes it, generates an output, and decodes the output back into a string. ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_name = "t5-small" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) input_text = "translate English to French: Hello, how are you?" input_ids = tokenizer(input_text, return_tensors="pt").input_ids outputs = model.generate(input_ids) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` -------------------------------- ### Load Data with String2String Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Demonstrates how to load data using the String2String library. Ensure the necessary libraries are installed before running. ```python from string2string.dataset import Dataset dataset = Dataset.load("data/train.tsv") ``` -------------------------------- ### String2String Configuration Example Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Illustrates how to configure the String2String model with various parameters. Adjust these settings based on your specific task and hardware. ```python from string2string.model import Model config = Model.get_config( model_name="model", model_dir="model_dir", batch_size=128, epochs=10, learning_rate=0.0001, max_seq_length=128, warmup_steps=1000, gradient_accumulation_steps=1, weight_decay=0.01, adam_epsilon=1e-8, max_grad_norm=1.0, logging_steps=100, save_steps=1000, eval_steps=1000, no_cuda=False, seed=42, fp16=False, fp16_opt_level="O1", local_rank=-1, server_ip='', server_port='' ) ``` -------------------------------- ### Initialize Model Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Initializes a model with specified parameters. This is a common setup for many NLP tasks. ```python from string2string.model import Model model = Model(d_model=512, num_layers=6, num_heads=8, dff=2048, vocab_size=30000, dropout=0.1) ``` -------------------------------- ### Batch Processing Example Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Illustrates how to process data in batches, a common practice for efficient training of deep learning models. ```python for batch, (inp, tar) in enumerate(dataset): # Process each batch ``` -------------------------------- ### Load and Process Data with String2String Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Demonstrates loading a dataset and performing basic preprocessing steps. Ensure the 'datasets' library is installed. ```python from datasets import load_dataset dataset = load_dataset("json", data_files="/stanfordnlp/string2string/blob/main/data/multi_news.jsonl") def preprocess_function(examples): inputs = [ex["article"] for ex in examples["input_data"]] targets = [ex["summary"] for ex in examples["target_data"]] model_inputs = tokenizer(inputs, max_length=1024, truncation=True) labels = tokenizer(targets, max_length=128, truncation=True) model_inputs["labels"] = labels["input_ids"] return model_inputs tokenized_datasets = dataset.map(preprocess_function, batched=True) ``` -------------------------------- ### Another String to String Model Inference Example Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb This snippet provides another example of using a pre-trained string-to-string model for inference, similar to the previous ones but with a different input text. ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_name = "t5-small" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) input_text = "summarize: The Orbiter Discovery is scheduled to launch on Tuesday, August 9, 2005, at 4:00 PM EDT from Kennedy Space Center in Florida." input_ids = tokenizer(input_text, return_tensors="pt").input_ids outputs = model.generate(input_ids) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` -------------------------------- ### Checkpoint Manager Setup Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Sets up a checkpoint manager to save and restore model weights during training. This is crucial for resuming training and preventing loss of progress. ```python checkpoint_path = "./checkpoints/train" ckpt = tf.train.Checkpoint(model=model, optimizer=optimizer) ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5) ``` -------------------------------- ### Train the Model Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Starts the model training process using the configured Trainer. This may take a significant amount of time. ```python trainer.train() ``` -------------------------------- ### Importing Libraries Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb Imports necessary libraries for text processing and comparison. Ensure these are installed before running. ```python import pandas as pd import numpy as np import difflib import re from collections import Counter from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import PorterStemmer import nltk nltk.download('punkt') nltk.download('stopwords') ``` -------------------------------- ### Basic String2String Usage Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Demonstrates the fundamental usage of the String2String library for basic string transformations. Ensure the library is installed before running. ```python from string2string import String2String # Initialize the String2String object s2s = String2String() # Example usage input_string = "Hello, world!" output_string = s2s.transform(input_string) print(f"Input: {input_string}") print(f"Output: {output_string}") ``` -------------------------------- ### Load and Display Data Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb Loads a dataset and displays its first few rows. Ensure the 'pandas' library is installed and imported. ```python import pandas as pd df = pd.read_csv('/kaggle/input/plagiarism-detection-dataset/train.csv') df.head() ``` -------------------------------- ### Tokenization Example Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Shows how to tokenize text using the project's tokenizer. Tokenization is a crucial step in preparing text for machine learning models. ```python from string2string.tokenizer import String2StringTokenizer tokenizer = String2StringTokenizer.load("path/to/tokenizer") text = "Tokenize this sentence." tokens = tokenizer.tokenize(text) print(tokens) ``` -------------------------------- ### Plagiarism Detection Example Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Demonstrates a basic plagiarism detection check by comparing similarity scores. This is a simplified example. ```python from difflib import SequenceMatcher def plagiarism_check(text1, text2): ratio = SequenceMatcher(None, text1, text2).ratio() return ratio text_original = "This is the original document content." text_suspected = "This is the original document content, with some minor changes." similarity_score = plagiarism_check(text_original, text_suspected) print(f"Similarity Score: {similarity_score}") if similarity_score > 0.8: print("Potential plagiarism detected.") else: print("No significant plagiarism detected.") ``` -------------------------------- ### String2String Model for Translation Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb This example demonstrates translation using a String2String model. The input prompt specifies the source and target languages. ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_name = "google/flan-t5-small" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) inputs = tokenizer("Translate English to German: The weather is nice today.", return_tensors="pt") outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` -------------------------------- ### Initialize FaissSearch with OPT-125M Model Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb Downloads and initializes the FaissSearch tool with the specified Hugging Face model. Ensure the transformers library is installed. ```python # Let's download OPT-125M from Facebook using HuggingFace's transformers library model_name = 'facebook/opt-125m' faiss_search = FaissSearch(model_name_or_path = model_name) ``` -------------------------------- ### Detokenization Example Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Demonstrates detokenizing a list of tokens back into a human-readable string. This is the inverse operation of tokenization. ```python from string2string.tokenizer import String2StringTokenizer tokenizer = String2StringTokenizer.load("path/to/tokenizer") tokens = ["token1", "token2", "token3"] text = tokenizer.detokenize(tokens) print(text) ``` -------------------------------- ### Load and Use a Pre-trained Model Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb This snippet shows how to load a pre-trained string-to-string model and use it for inference. Ensure you have the necessary libraries installed. ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_name = "google/flan-t5-small" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) inputs = tokenizer("Translate English to French: How are you?", return_tensors="pt") outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` -------------------------------- ### Load and Prepare Patent Data Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb This snippet loads patent data and prepares it for further processing. Ensure the 'pandas' library is installed. ```python import pandas as pd df = pd.read_csv("/content/drive/MyDrive/data/USPTO/patent_data.csv") df.head() ``` -------------------------------- ### Fine-tune a Model on a Custom Dataset Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb This example demonstrates fine-tuning a pre-trained model on a custom dataset. It involves preparing the dataset and using the Trainer API. Make sure your dataset is in the correct format. ```python from datasets import load_dataset from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer model_name = "google/flan-t5-small" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) dataset = load_dataset("cnn_dailymail", "3.0.0") def preprocess_function(examples): inputs = [ex for ex in examples["article"]] model_inputs = tokenizer(inputs, max_length=1024, truncation=True) labels = tokenizer(text_target=examples["highlights"], max_length=128, truncation=True) model_inputs["labels"] = labels["input_ids"] return model_inputs tokenized_datasets = dataset.map(preprocess_function, batched=True) args = Seq2SeqTrainingArguments( "./results", evaluation_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, weight_decay=0.01, save_total_limit=1, num_train_epochs=3, predict_with_generate=True, fp16=True, push_to_hub=False, ) data_collator = DataCollatorForSeq2Seq(tokenizer, model=model) trainer = Seq2SeqTrainer( model, args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["validation"], data_collator=data_collator, tokenizer=tokenizer, ) trainer.train() ``` -------------------------------- ### Load and Use a Pre-trained String2String Model Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb This snippet shows how to load a pre-trained model and use it for inference. Ensure you have the necessary libraries installed. ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_name = "google/flan-t5-small" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) inputs = tokenizer("Translate English to French: Hello, how are you?", return_tensors="pt") outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` -------------------------------- ### Initialize Plagiarism Detector Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb Initializes the plagiarism detector with a specific algorithm. This setup is required before processing any essays. ```python from plagiarism_detector import PlagiarismDetector plagiarism_detector = PlagiarismDetector('lcs') ``` -------------------------------- ### Load and Prepare Patent Data Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb Loads patent data from a specified source and prepares it for further processing. Ensure the 'pandas' library is installed. ```python import pandas as pd # Load the dataset df = pd.read_csv("/kaggle/input/uspto-patent-abstracts/USPTO_patent_abstracts.csv") # Display the first 5 rows and columns df.head() ``` -------------------------------- ### Initialize Plagiarism Detector Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb Initializes a plagiarism detection model using the 'plagiarism_detector' library. This setup is necessary before performing any detection tasks. ```python from plagiarism_detector import PlagiarismDetector plagiarism_detector = PlagiarismDetector() plagiarism_detector.load_model() ``` -------------------------------- ### String2String Model with Batch Inference Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb This example illustrates how to perform inference on a batch of inputs using the String2String model. This is more efficient for processing multiple sequences. ```python from string2string import String2String model = String2String.from_pretrained("path/to/your/model") inputs = ["input text 1", "input text 2"] results = model.predict_batch(inputs) print(results) ``` -------------------------------- ### String2String Model for Question Answering Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb This example shows how to use a String2String model for question answering. The input format includes the question and the context. ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_name = "google/flan-t5-small" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) question = "What is the capital of France?" context = "France is a country in Europe. Its capital is Paris." inputs = tokenizer(f"question: {question} context: {context}", return_tensors="pt") outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` -------------------------------- ### Inference Example Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Performs inference using the trained model to generate output for a given input. This is typically done after training is complete. ```python result, _ = model.translate(sentence) ``` -------------------------------- ### Load and Use a Pre-trained Encoder-Decoder Model Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Loads a generic encoder-decoder model and tokenizer. This is a foundational example for many sequence-to-sequence tasks. ```python from transformers import EncoderDecoderModel, AutoTokenizer model = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "gpt2") tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ``` -------------------------------- ### Python String Parsing Example Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Demonstrates basic string parsing in Python. This snippet is useful for extracting information from structured strings. ```python import sys def main(): # Example usage of string parsing # This part of the code is not directly shown in the provided snippet but is implied by the context. pass if __name__ == "__main__": main() ``` -------------------------------- ### String2String Utility Function Example Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Illustrates the usage of a utility function within the String2String library. This function performs a specific text transformation. ```python from string2string.utils.text import transform_text processed_text = transform_text("sample text") ``` -------------------------------- ### String2String Utility: Text Style Transfer (Few-Shot Prompting) Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Applies few-shot prompting for style transfer, providing examples within the prompt to guide the model. ```python from string2string.utils.style_transfer import few_shot_prompt_style_transfer styled_text = few_shot_prompt_style_transfer(llm_model, "prompt with examples", "text") ``` -------------------------------- ### Load and Use a Pre-trained String-to-String Model Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb This snippet shows how to load a pre-trained model and tokenizer from Hugging Face Transformers and use it for inference. Ensure you have the 'transformers' library installed. ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_name = "t5-small" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) input_text = "translate English to German: The house is wonderful." input_ids = tokenizer(input_text, return_tensors="pt").input_ids outputs = model.generate(input_ids) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` -------------------------------- ### String2String with Custom Configuration Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Shows how to initialize String2String with custom configurations for specific transformation needs. This allows for fine-tuning the library's behavior. ```python from string2string import String2String # Define custom configuration parameters config = { "model_name": "my_custom_model", "max_length": 100, "temperature": 0.7 } # Initialize String2String with custom configuration s2s_custom = String2String(config=config) # Example usage with custom configuration input_string = "Another example string." output_string = s2s_custom.transform(input_string) print(f"Input: {input_string}") print(f"Output: {output_string}") ``` -------------------------------- ### compute_multi_ref_score Source: https://github.com/stanfordnlp/string2string/blob/main/docs/metrics.md Scores a batch of examples with multiple references. ```APIDOC ## compute_multi_ref_score ### Description Scores a batch of examples with multiple references. ### Parameters * **source_sentences** (List[str]) - The source sentences. * **target_sentences** (List[List[str]]) - The target sentences. * **agg** (str) - The aggregation method. Can be “mean” or “max”. * **batch_size** (int) - The batch size. ### Returns The BARTScore for each example. ### Return type Dict[str, List[float]] ### Raises **ValueError** - If the number of source sentences and target sentences do not match. ``` -------------------------------- ### BARTScore.compute_multi_ref_score Source: https://github.com/stanfordnlp/string2string/blob/main/docs/similarity.md Scores a batch of examples with multiple references using BARTScore. ```APIDOC ## BARTScore.compute_multi_ref_score ### Description Score a batch of examples with multiple references. ### Parameters * **source_sentences** (List[str]) - The source sentences. * **target_sentences** (List[List[str]]) - The target sentences. * **agg** (str) - The aggregation method. Can be “mean” or “max”. * **batch_size** (int) - The batch size. ### Returns The BARTScore for each example. ### Return type Dict[str, List[float]] ### Raises **ValueError** - If the number of source sentences and target sentences do not match. ``` -------------------------------- ### Define Training Arguments Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Sets up training arguments for the model. This includes output directory, learning rate, and number of epochs. ```python from transformers import TrainingArguments training_args = TrainingArguments( output_dir="/content/drive/MyDrive/string2string/results", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=3, weight_decay=0.01, evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, push_to_hub=False, ) ``` -------------------------------- ### String2String Model with Custom Configuration Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Demonstrates loading a String2String model with custom configuration parameters. This allows for fine-tuning the model's behavior. ```python from string2string import String2String from string2string.config import String2StringConfig config = String2StringConfig.from_pretrained("path/to/your/config") model = String2String.from_pretrained("path/to/your/model", config=config) result = model.predict("input text") print(result) ``` -------------------------------- ### Initialize String2String Model and Tokenizer Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Shows how to load a pre-trained model and its corresponding tokenizer for sequence-to-sequence tasks. This is a prerequisite for training or inference. ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_name = "google/flan-t5-small" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) ``` -------------------------------- ### String2String Utility: Text Adversarial Attack Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Generates adversarial examples for text to test model robustness. ```python from string2string.utils.adversarial import generate_adversarial_text adv_text = generate_adversarial_text("This is a normal sentence.") ``` -------------------------------- ### Initialize and Use String2String Model Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Demonstrates how to initialize and use a String2String model for text processing. This is a foundational step for many tasks within the project. ```python from string2string.model import String2String model = String2String("path/to/your/model") result = model.predict("This is a sample text.") print(result) ``` -------------------------------- ### Get Plagiarism Results Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb Retrieves the plagiarism detection results. This function returns a list of detected plagiarism instances. ```python results = plagiarism_detector.get_results() results ``` -------------------------------- ### Prepare Data for Plotly Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb This snippet calls the prepare_plotly_data function to get the necessary data structures for Plotly visualization. ```python # Let's prepare the data for plotly tsne_coords, tsne_labels, tsne_titles, tsne_hover_texts = prepare_plotly_data( tsne_embeddings, patent_titles, patent_ipc_subclass_labels, most_common_labels) ``` -------------------------------- ### Load and Prepare Data Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb Loads patent data and prepares it for semantic search. Ensure the 'data' directory exists and contains the necessary patent files. ```python import os import pandas as pd # Load the dataset df = pd.read_csv("/data/patent.csv") # Display the first 5 rows df.head() ``` -------------------------------- ### Initialize and Run Plagiarism Detection Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb This snippet shows how to initialize and run a plagiarism detection process. It involves setting up parameters and executing the detection. ```python from string2string.plagiarism_detection import PlagiarismDetector # Initialize the detector detector = PlagiarismDetector( model_name="sentence-transformers/all-MiniLM-L6-v2", threshold=0.8, batch_size=16, device="cpu", ) # Run plagiarism detection # Replace with your actual text data text_data = { "doc1": "This is the first document.", "doc2": "This is the second document, which is similar to the first.", "doc3": "This is a completely different document.", } data = detector.run_plagiarism_detection(text_data) print(data) ``` -------------------------------- ### Plagiarism Detection with String2String Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Example of using the String2String model for plagiarism detection. This snippet assumes a model fine-tuned for this task. ```python from string2string.model import String2String plagiarism_model = String2String.load("path/to/plagiarism/model") text1 = "Original text content." text2 = "Slightly modified version of the original text." # Assuming the model outputs a score or a classification result = plagiarism_model.predict(text1, text2) print(f"Plagiarism score: {result}") ``` -------------------------------- ### Generate Text with T5 Model Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Generates text using a T5 model. This example shows a simple translation task. ```python input_text = "translate English to German: That is good." input_ids = tokenizer(input_text, return_tensors="pt").input_ids outputs = model.generate(input_ids) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` -------------------------------- ### Initialize FaissSearch and Corpus Source: https://github.com/stanfordnlp/string2string/blob/main/README.md Instantiate the FaissSearch class with a specified Hugging Face model and initialize the corpus for semantic search. ```python >>> # Let's create a FaissSearch class instance from the search module to perform semantic search >>> from string2string.search import FaissSearch >>> faiss_search = FaissSearch(model_name_or_path = 'facebook/bart-large') >>> # Let's create a corpus of strings (e.g., sentences) >>> corpus = { 'text': [ "Coffee is my go-to drink in the morning.", "I always try to make time for exercise.", "Learning something new every day keeps me motivated.", "The sunsets in my hometown are breathtaking.", "I am grateful for the support of my friends and family.", "The book I'm reading is incredibly captivating.", "I love listening to music while I work.", "I'm excited to try the new restaurant in town.", "Taking a walk in nature always clears my mind.", "I believe that kindness is the most important trait.", "It's important to take breaks throughout the day.", "I'm looking forward to the weekend.", "Reading before bed helps me relax.", "I try to stay positive even in difficult situations.", "Cooking is one of my favorite hobbies.", "I'm grateful for the opportunity to learn and grow every day.", "I love traveling and experiencing new cultures.", "I'm proud of the progress I've made so far.", "A good night's sleep is essential for my well-being.", "Spending time with loved ones always brings me joy.", "I'm grateful for the beauty of nature around me.", "I try to live in the present moment and appreciate what I have.", "I believe that honesty is always the best policy.", "I enjoy challenging myself and pushing my limits.", "I'm excited to see what the future holds." ], } >>> # Next we need to initialize and encode the corpus >>> faiss_search.initialize_corpus( corpus=corpus, section='text', embedding_type='mean_pooling', ) ``` -------------------------------- ### Python String Manipulation Example Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Illustrates string manipulation techniques in Python. This is helpful for transforming and cleaning string data. ```python import sys def main(): # Example usage of string manipulation # This part of the code is not directly shown in the provided snippet but is implied by the context. pass if __name__ == "__main__": main() ``` -------------------------------- ### Load and Prepare Data Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb Loads patent data from a specified path and prepares it for further processing. Ensure the 'data' directory exists and contains the necessary files. ```python import os import pandas as pd data_path = "/stanfordnlp/string2string" def load_data(data_path): """Load data from the specified path.""" data = [] for filename in os.listdir(data_path): if filename.endswith(".jsonl"): file_path = os.path.join(data_path, filename) with open(file_path, "r") as f: for line in f: data.append(eval(line)) return pd.DataFrame(data) df = load_data(data_path) df.head() ``` -------------------------------- ### Initialize GloVe Embeddings Source: https://github.com/stanfordnlp/string2string/blob/main/docs/embedding.md Instantiate the GloVeEmbeddings class with specified model, dimension, and download options. The model will be downloaded automatically if not found. ```python from string2string.misc.word_embeddings import GloVeEmbeddings # Initialize with default parameters glove_embeddings = GloVeEmbeddings() # Initialize with a specific model and dimension glove_embeddings_custom = GloVeEmbeddings(model='glove.twitter.27B', dim=100) # Force download if the model already exists glove_embeddings_force_download = GloVeEmbeddings(force_download=True) ``` -------------------------------- ### String2String Model with Custom Generation Parameters Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Demonstrates how to control the generation process by setting parameters like `max_length` and `num_beams`. This allows for more tailored output. ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_name = "google/flan-t5-small" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) inputs = tokenizer("Summarize: The quick brown fox jumps over the lazy dog.", return_tensors="pt") outputs = model.generate(inputs["input_ids"], max_length=50, num_beams=4, early_stopping=True) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` -------------------------------- ### Load and Prepare Data Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb Loads patent data from a specified path and prepares it for further processing. Ensure the 'data_path' variable points to your dataset. ```python import pandas as pd data_path = "/content/drive/MyDrive/data/patents.csv" df = pd.read_csv(data_path) df.head() ``` -------------------------------- ### Load and Prepare Data Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb Loads patent data from a specified path and prepares it for further processing. Ensure the 'data_path' variable points to your dataset. ```python import pandas as pd data_path = "/content/drive/MyDrive/data/USPTO_patent_abstracts.csv" df = pd.read_csv(data_path) df.head() ``` -------------------------------- ### Summarize Text with Pegasus Model Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Generates a summary for a given text using a Pegasus model. This example focuses on abstractive summarization. ```python ARTICLE = """ Your text to summarize goes here. Pegasus is designed for abstractive summarization, meaning it can generate summaries that are not just extracts of the original text. """ inputs = tokenizer(ARTICLE, truncation=True, padding="longest", return_tensors="pt") summary_ids = model.generate(inputs.input_ids, num_beams=1, max_length=100) print(tokenizer.decode(summary_ids[0], skip_special_tokens=True)) ``` -------------------------------- ### Configuration Loading Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Loads configuration parameters from a JSON file. This is a common way to manage hyperparameters and settings. ```python import json with open('config.json', 'r') as f: config = json.load(f) ``` -------------------------------- ### Python String Utility Function Example Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Shows a utility function for string processing in Python. This can be used for common string-related tasks. ```python import sys def main(): # Example usage of string utility functions # This part of the code is not directly shown in the provided snippet but is implied by the context. pass if __name__ == "__main__": main() ``` -------------------------------- ### Initialize Rabin-Karp Search Source: https://github.com/stanfordnlp/string2string/blob/main/[Tutorial] Search.ipynb Import and initialize the RabinKarpSearch class, passing the configured hash function. This sets up the search object for pattern matching. ```python from string2string.search import RabinKarpSearch rabin_karp = RabinKarpSearch( hash_function=rolling_hash, ) ``` -------------------------------- ### BoyerMooreSearch.__init__() Source: https://github.com/stanfordnlp/string2string/blob/main/docs/matching.md Initializes the Boyer-Moore search algorithm class. The Boyer-Moore search algorithm is a string searching algorithm that uses a heuristic to skip over large sections of the search string, resulting in faster search times than traditional algorithms. ```APIDOC ## BoyerMooreSearch.__init__() ### Description Initializes the Boyer-Moore search algorithm class. ### Method __init__ ### Parameters None ### Returns None ``` -------------------------------- ### Load and Prepare Data Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Semantic Search and Visualization of Abstract Sections of USPTO Patents.ipynb Loads patent data from a specified path and prepares it for further processing. Ensure the 'data_path' variable points to the correct location of your patent data. ```python import os import pandas as pd data_path = "/content/drive/MyDrive/data/USPTO/patent_data.csv" def load_data(path): if not os.path.exists(path): raise FileNotFoundError(f"Data not found at {path}") df = pd.read_csv(path) return df df = load_data(data_path) df.head() ``` -------------------------------- ### String2String Utility: Get Model Configuration Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Retrieves the configuration details of the String2String model. Useful for understanding model architecture and hyperparameters. ```python from string2string.utils.config import get_config config = get_config() print(config) ``` -------------------------------- ### GloVeEmbeddings Source: https://github.com/stanfordnlp/string2string/blob/main/docs/embedding.md This class implements the GloVe word embeddings. It can be initialized with various models and dimensions, and can be used to get embeddings for given tokens. ```APIDOC ## class string2string.misc.word_embeddings.GloVeEmbeddings ### Description This class implements the GloVe word embeddings. ### Parameters * **model** (str) - Optional - The model to use. Default is ‘glove.6B.200D’. (Options are: ‘glove.6B.200D’, ‘glove.twitter.27B’, ‘glove.42B.300d’, ‘glove.840B.300d’.) * **dim** (int) - Optional - The dimension of the embeddings. Default is 300. * **force_download** (bool) - Optional - Whether to force download the model. Default is False. * **dir** (str) - Optional - The directory to save or load the model. Default is None. * **tokenizer** (Tokenizer) - Optional - The tokenizer to use. Default is None. ### Methods #### __call__(tokens: List[str] | str) -> Tensor This function returns the embeddings of the given tokens. * **Parameters:** **tokens** (Union[List[str], str]) – The tokens to embed. * **Returns:** The embeddings of the given tokens. * **Return type:** Tensor #### get_embedding(tokens: List[str] | str) -> Tensor This function returns the embeddings of the given tokens. * **Parameters:** **tokens** (Union[List[str], str]) – The tokens to embed. * **Returns:** The embeddings of the given tokens. * **Return type:** Tensor ### Raises **ValueError** – If the model is not in the MODEL_OPTIONS [glove.6B.200D’, ‘glove.twitter.27B’, ‘glove.42B.300d’, ‘glove.840B.300d’]. ### Notes * If directory is None, the model will be saved in the torch hub directory. * If the model is not downloaded, it will be downloaded automatically. ``` -------------------------------- ### Load and Use a Pre-trained Model Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Demonstrates loading a pre-trained model for sentence embeddings. This is a foundational step for many NLP tasks. ```python from sentence_transformers import SentenceTransformer # Load a pre-trained model model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # Example usage: encode sentences sentences = [ 'This is the first sentence.', 'This is the second sentence.' ] embeddings = model.encode(sentences) print(embeddings) ``` -------------------------------- ### Initialize FaissSearch Source: https://github.com/stanfordnlp/string2string/blob/main/[Tutorial] Search.ipynb Import and initialize the FaissSearch class with a specified model and tokenizer. This sets up the core object for semantic search operations. ```python # Import the FaissSearch class from the search module from string2string.search import FaissSearch # Initialize the FaissSearch class faiss_search = FaissSearch( model_name_or_path = 'facebook/bart-large', tokenizer_name_or_path = 'facebook/bart-large', ) ``` -------------------------------- ### Load and Process Essay Data Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb This snippet shows how to load and preprocess essay data for plagiarism detection. Ensure the 'pandas' library is installed. ```python import pandas as pd essays = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Plagiarism Detection of Essays/essays.csv") essays.head() ``` -------------------------------- ### Train String2String Model Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Illustrates the process of training a String2String model. This typically involves providing training data and configuration. ```python from string2string.model import String2String model = String2String() model.train("path/to/training/data", "path/to/save/model") print("Model trained and saved.") ``` -------------------------------- ### Initialize Similarity Metrics Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb Initializes instances of LongestCommonSubsequence and LongestCommonSubstring for similarity calculations. ```python # Initialize the similarity classes lcsubseq = LongestCommonSubsequence() lcsubstr = LongestCommonSubstring() ``` -------------------------------- ### Load and Display Data Source: https://github.com/stanfordnlp/string2string/blob/main/[Hands_On_Tutorial] Plagiarism Detection of Essays.ipynb Loads a dataset and displays the first few rows. Ensure the 'data.csv' file is in the same directory. ```python import pandas as pd df = pd.read_csv('data.csv') df.head() ``` -------------------------------- ### BoyerMooreSearch.search() Source: https://github.com/stanfordnlp/string2string/blob/main/docs/matching.md Searches for a pattern within a given text using the Boyer-Moore algorithm. Returns the starting index of the first occurrence or -1 if not found. ```APIDOC ## BoyerMooreSearch.search(pattern: str, text: str) ### Description This function searches for the pattern in the text using the Boyer-Moore algorithm. ### Method search ### Parameters #### Path Parameters - **pattern** (str) - Required - The pattern to search for. - **text** (str) - Required - The text to search in. ### Returns The index of the pattern in the text (or -1 if the pattern is not found). ### Return type int ### Raises **AssertionError** – If the text or the pattern is not a string. ``` -------------------------------- ### String2String Model for Text Generation Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb This snippet illustrates using a String2String model for general text generation. You can provide a prompt to guide the output. ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_name = "google/flan-t5-small" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) inputs = tokenizer("Write a short story about a robot: Once upon a time,", return_tensors="pt") outputs = model.generate(**inputs, max_length=100, num_return_sequences=1) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` -------------------------------- ### FaissSearch Initialization Source: https://github.com/stanfordnlp/string2string/blob/main/docs/matching.md Initializes the wrapper for the FAISS library, used for semantic search. It allows specifying the model, tokenizer, and device. ```APIDOC ## FaissSearch ### Description Initializes the wrapper for the FAISS library, which is used to perform semantic search. ### Method __init__ ### Parameters * **model_name_or_path** (str, optional) – The name or path of the model to use. Defaults to ‘facebook/bart-large’. * **tokenizer_name_or_path** (str, optional) – The name or path of the tokenizer to use. Defaults to ‘facebook/bart-large’. * **device** (str, optional) – The device to use. Defaults to ‘cpu’. ### Returns None ``` -------------------------------- ### Load Data for String to String Models Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Loads data from a specified path for training string-to-string models. Ensure the path points to a valid dataset. ```python from google.colab import auth auth.authenticate_user() from google.cloud import storage client = storage.Client(project='string2string') bucket = client.get_bucket('string2string') blob = bucket.blob('plagiarism_detection.ipynb') blob.download_to_filename('plagiarism_detection.ipynb') ``` -------------------------------- ### Custom Layer Example Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Shows the definition of a custom layer within a TensorFlow/Keras model. This layer might implement specific attention mechanisms or transformations. ```python class CustomLayer(tf.keras.layers.Layer): def __init__(self, **kwargs): super(CustomLayer, self).__init__(**kwargs) def build(self, input_shape): # Define layer weights here super(CustomLayer, self).build(input_shape) def call(self, inputs): # Define forward pass logic here return inputs ``` -------------------------------- ### Perform Inference with String2String Model Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Shows how to use a pre-trained String2String model for inference. This requires a model file and input data. ```python from string2string.model import String2String model = String2String("model.pth") result = model.predict("input text") ``` -------------------------------- ### Generate Text with Encoder-Decoder Model Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Generates text using a configured encoder-decoder model. This setup allows combining different encoder and decoder architectures. ```python from transformers import AutoTokenizer # Assuming 'model' and 'tokenizer' are loaded as in the previous snippet input_text = "This is an example input." input_ids = tokenizer(input_text, return_tensors="pt").input_ids # Set decoder start token ID if needed (e.g., for GPT-2 decoder) tokenizer.decoder_start_token_id = tokenizer.cls_token_id outputs = model.generate(input_ids) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` -------------------------------- ### String2String Model with Custom Generation Parameters Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Shows how to customize generation parameters for the String2String model, such as `max_length`, `num_beams`, and `temperature`. These parameters control the output sequence generation. ```python from string2string import String2String model = String2String.from_pretrained("path/to/your/model") result = model.predict("input text", max_length=50, num_beams=4, temperature=0.7) print(result) ``` -------------------------------- ### KMP Search Algorithm Implementation Source: https://github.com/stanfordnlp/string2string/blob/main/README.md Demonstrates how to use the KMPSearch class to find the index of a pattern within a text using the KMP algorithm. Ensure the KMPSearch class is imported from string2string.search. ```python >>> # Let's create a KMPSearch class instance from the search module >>> from string2string.search import KMPSearch >>> knuth_morris_pratt = KMPSearch() >>> # Let's define a pattern and a text >>> pattern = Jane Austen' >>> text = 'Sense and Sensibility, Pride and Prejudice, Emma, Mansfield Park, Northanger Abbey, Persuasion, and Lady Susan were written by Jane Austen and are important works of English literature.' >>> # Now let's find the index of the pattern in the text, if it exists (otherwise, -1 is returned). >>> idx = knuth_morris_pratt.search(pattern=pattern,text=text) >>> print(f'The index of the pattern in the text is {idx}.') # The index of the pattern in the text is 127. ``` -------------------------------- ### get_alignment Source: https://github.com/stanfordnlp/string2string/blob/main/docs/alignment.md Gets the alignment of two strings (or list of strings) using the Hirschberg algorithm. This method provides a space-efficient solution with O(nm) time complexity. ```APIDOC ## get_alignment ### Description Gets the alignment of two strings (or list of strings) by using the Hirschberg algorithm. ### Parameters * **str1** - The first string (or list of strings). * **str2** - The second string (or list of strings). ### Returns The aligned strings as a tuple of two strings (or list of strings). ``` -------------------------------- ### Get Embeddings for Tokens Source: https://github.com/stanfordnlp/string2string/blob/main/docs/embedding.md Retrieves the word embeddings for a given list of tokens or a single string of tokens. The function returns a Tensor containing the embeddings. ```APIDOC ## __call__(tokens: List[str] | str) -> Tensor ### Description This function returns the embeddings of the given tokens. ### Parameters * **tokens** (Union[List[str], str]) – The tokens to embed. ### Returns The embeddings of the given tokens. ### Return type Tensor ``` ```APIDOC ## get_embedding(tokens: List[str] | str) -> Tensor ### Description This function returns the embeddings of the given tokens. ### Parameters * **tokens** (Union[List[str], str]) – The tokens to embed. ### Returns The embeddings of the given tokens. ### Return type Tensor ``` -------------------------------- ### Train a Model with String2String Source: https://github.com/stanfordnlp/string2string/blob/main/docs/hupd_example.ipynb Shows the process of training a model using the String2String library. This involves specifying model parameters and the dataset. ```python from string2string.model import Model model = Model.train( dataset=dataset, model_name="model", model_dir="model_dir", batch_size=128, epochs=10, learning_rate=0.0001, max_seq_length=128, warmup_steps=1000, gradient_accumulation_steps=1, weight_decay=0.01, adam_epsilon=1e-8, max_grad_norm=1.0, logging_steps=100, save_steps=1000, eval_steps=1000, no_cuda=False, seed=42, fp16=False, fp16_opt_level="O1", local_rank=-1, server_ip='', server_port='' ) ``` -------------------------------- ### Get Word Embeddings Source: https://github.com/stanfordnlp/string2string/blob/main/docs/embedding.md Retrieve the vector representations for a list of tokens or a single string of tokens. The class handles tokenization internally if a string is provided. ```python from string2string.misc.word_embeddings import GloVeEmbeddings glove_embeddings = GloVeEmbeddings() # Get embeddings for a list of tokens embeddings_list = glove_embeddings(['hello', 'world']) # Get embeddings for a single string of tokens embeddings_string = glove_embeddings('this is a test') # The get_embedding method can also be used directly embeddings_direct = glove_embeddings.get_embedding(['another', 'example']) print(embeddings_list.shape) print(embeddings_string.shape) print(embeddings_direct.shape) ``` -------------------------------- ### String2String Utility: Model Loading from Checkpoint Source: https://github.com/stanfordnlp/string2string/blob/main/docs/plagiarism_detection.ipynb Loads a model's state, including optimizer state and epoch number, from a saved checkpoint. ```python from string2string.utils.checkpointing import load_checkpoint model, optimizer, epoch = load_checkpoint("model_checkpoint.pt") ```