### Install Libraries for Semantic Chunking Source: https://github.com/minishlab/model2vec/blob/main/tutorials/semantic_chunking.ipynb Install the necessary Python libraries for semantic chunking, vector embeddings, and nearest neighbor search. Use '-q' for quiet installation. ```python # Install the necessary libraries !pip install -q datasets model2vec numpy tqdm vicinity "chonkie[semantic]" ``` -------------------------------- ### Install Model2Vec Training Extras Source: https://github.com/minishlab/model2vec/blob/main/model2vec/train/README.md Install the necessary components for training classifiers with model2vec. ```bash pip install model2vec[train] ``` -------------------------------- ### Install and Import Libraries Source: https://github.com/minishlab/model2vec/blob/main/tutorials/recipe_search.ipynb Installs necessary Python packages and imports required libraries for recipe search and Model2Vec functionality. ```python # Install the necessary libraries !pip install numpy datasets scikit-learn transformers model2vec # Import the necessary libraries import regex from collections import Counter import numpy as np from datasets import load_dataset from sklearn.metrics import pairwise_distances from tokenizers.pre_tokenizers import Whitespace from model2vec import StaticModel from model2vec.distill import distill ``` -------------------------------- ### Install model2vec Source: https://github.com/minishlab/model2vec/blob/main/model2vec/modelcards/model_card_template.md Install the model2vec library using pip. This is the first step to using the library. ```bash pip install model2vec ``` -------------------------------- ### Install Model2Vec with Distillation Extras Source: https://github.com/minishlab/model2vec/blob/main/README.md Install the 'distillation' extras for Model2Vec to enable custom model distillation. ```bash pip install model2vec[distill] ``` -------------------------------- ### Install model2vec Source: https://github.com/minishlab/model2vec/blob/main/model2vec/modelcards/classifier_template.md Install the model2vec library with inference capabilities using pip. ```bash pip install model2vec[inference] ``` -------------------------------- ### Install Model2Vec and Dependencies Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb Install the necessary libraries for Model2Vec training and inference, along with datasets and scikit-learn. This is a prerequisite for using the training functionalities. ```python # Install the necessary libraries !uv pip install "model2vec[train,inference]" !uv pip install "datasets" !uv pip install "scikit-learn" ``` -------------------------------- ### Model Summary and Training Setup Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb Displays the model architecture, parameter count, and indicates GPU availability. This output is typical after initializing a PyTorch Lightning Trainer. ```python Seed set to 42 GPU available: True (mps), used: True TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs /Users/stephantulkens/Documents/GitHub/model2vec/.venv/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:76: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default /Users/stephantulkens/Documents/GitHub/model2vec/.venv/lib/python3.11/site-packages/torch/optim/lr_scheduler.py:60: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate. warnings.warn( | Name | Type | Params | Mode --------------------------------------------------------------- 0 | model | StaticModelForClassification | 7.7 M | train --------------------------------------------------------------- 7.7 M Trainable params 0 Non-trainable params 7.7 M Total params 30.922 Total estimated model params size (MB) 6 Modules in train mode 0 Modules in eval mode ``` -------------------------------- ### Display Example Chunks Source: https://github.com/minishlab/model2vec/blob/main/tutorials/semantic_chunking.ipynb Print a few randomly selected text chunks generated by Chonkie to inspect the results of the semantic chunking process. ```python # Print a few example chunks for _ in range(3): chunk = random.choice(chunks) print(chunk.text, "\n") ``` -------------------------------- ### Load StaticModelPipeline Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb Loads a StaticModelPipeline from a local directory or the Hugging Face Hub. This pipeline is optimized for fast cold starts as it does not require PyTorch. ```python new_model = StaticModelPipeline.from_pretrained("my_cool_model") # Or from the hub # model = StaticModelPipeline.from_pretrained("my_org/my_model") ``` -------------------------------- ### Train and Evaluate Classifier Source: https://github.com/minishlab/model2vec/blob/main/model2vec/train/README.md Train a classifier on a dataset and evaluate its performance. Assumes the 'datasets' library is installed. ```python import numpy as np from datasets import load_dataset from time import perf_counter # Load the subj dataset ds = load_dataset("setfit/subj") train = ds["train"] test = ds["test"] s = perf_counter() classifier = classifier.fit(train["text"], train["label"]) print(f"Training took {int(perf_counter() - s)} seconds.") # Training took 81 seconds classification_report = classifier.evaluate(ds["test"]["text"], ds["test"]["label"]) print(classification_report) # Achieved 91.0 test accuracy ``` -------------------------------- ### Sanity Checking Progress Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb Indicates the start of the sanity check phase for the model. This is a preliminary step before full training begins. ```text Result: Sanity Checking: | | 0/? [00:00= min_length] # Recombine the filtered sentences return ' '.join(filtered_sentences) # Preprocess the text book_text = preprocess_text(book_text) ``` -------------------------------- ### Distill a Custom Model2Vec Model Source: https://github.com/minishlab/model2vec/blob/main/README.md Distill a Sentence Transformer model into a Model2Vec model. This process can be done on a CPU in approximately 30 seconds. The distilled model can then be saved. ```python from model2vec.distill import distill # Distill a Sentence Transformer model, in this case the BAAI/bge-base-en-v1.5 model m2v_model = distill(model_name="BAAI/bge-base-en-v1.5") # Save the model m2v_model.save_pretrained("m2v_model") ``` -------------------------------- ### Convert Classifier to Scikit-learn Pipeline Source: https://github.com/minishlab/model2vec/blob/main/model2vec/train/README.md Use this to convert a trained classifier into a scikit-learn compatible pipeline object for easier integration and persistence. ```python pipeline = classifier.to_pipeline() ``` -------------------------------- ### Create Vocabulary from Texts Source: https://github.com/minishlab/model2vec/blob/main/tutorials/recipe_search.ipynb Defines a function to create a vocabulary from a list of texts using a regex tokenizer and a counter. The vocabulary is sorted by token frequency. ```python my_regex = regex.compile(r"\w+|[^\w\s]+") def create_vocab(texts: list[str], tokenizer: Whitespace, size: int = 30_000) -> list[str]: """ Create a vocab from a list of texts. :param texts: A list of texts. :param tokenizer: A whitespace tokenizer. :param size: The size of the vocab. :return: A vocab sorted by frequency. """ counts = Counter() for text in texts: tokens = tokenizer.pre_tokenize_str(text.lower()) tokens = [token for token, _ in tokens] counts.update(tokens) vocab = [word for word, _ in counts.most_common(size)] return vocab ``` -------------------------------- ### Export Model to Scikit-learn Pipeline Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb Converts the trained model to a scikit-learn pipeline. This allows for consistent prediction and evaluation using standard scikit-learn tools. ```python pipeline = model.to_pipeline() predictions = pipeline.predict(dataset["test"]["text"]) print(classification_report(dataset["test"]["label_text"], predictions)) ``` -------------------------------- ### Save Model Locally Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb Save the trained pipeline to a local directory. This is useful for later use or deployment. ```python pipeline.save_pretrained("my_cool_model") ``` -------------------------------- ### Similarity Search Function Source: https://github.com/minishlab/model2vec/blob/main/tutorials/recipe_search.ipynb Defines a function to find the most similar recipe titles to a given query using a Model2Vec model and precomputed embeddings. It calculates cosine similarity and returns the top K results. ```python # Define a function to find the most similar titles in a dataset to a given query def find_most_similar_items(model: StaticModel, embeddings: np.ndarray, query: str, top_k=5) -> list[tuple[int, float]]: """ Finds the most similar items in a dataset to the given query using the specified model. :param model: The model used to generate embeddings. :param embeddings: The embeddings of the dataset. :param query: The query recipe title. :param top_k: The number of most similar titles to return. :return: A list of tuples containing the most similar titles and their cosine similarity scores. """ # Generate embedding for the query query_embedding = model.encode(query)[None, :] # Calculate pairwise cosine distances between the query and the precomputed embeddings distances = pairwise_distances(query_embedding, embeddings, metric='cosine')[0] # Get the indices of the most similar items (sorted in ascending order because smaller distances are better) most_similar_indices = np.argsort(distances) # Convert distances to similarity scores (cosine similarity = 1 - cosine distance) most_similar_scores = [1 - distances[i] for i in most_similar_indices[:top_k]] # Return the top-k most similar indices and similarity scores return list(zip(most_similar_indices[:top_k], most_similar_scores)) ``` -------------------------------- ### Evaluate Multi-label Classifier Source: https://github.com/minishlab/model2vec/blob/main/model2vec/train/README.md Evaluate a multi-label classifier using specified text, labels, and a classification threshold. ```python from sklearn import metrics from sklearn.preprocessing import MultiLabelBinarizer classification_report = classifier.evaluate(ds["test"]["text"], ds["test"]["labels"Показать], threshold=0.3) print(classification_report) # Accuracy: 0.410 # Precision: 0.527 # Recall: 0.410 # F1: 0.439 ``` -------------------------------- ### Print Classification Report Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb Generates a classification report to evaluate the model's performance on the test dataset. This includes precision, recall, F1-score, and support for each class. ```python from sklearn.metrics import classification_report predictions = model.predict(dataset["test"]["text"]) print(classification_report(dataset["test"]["label_text"], predictions)) ``` -------------------------------- ### Validation Progress Indicator Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb Displays the progress bar for the validation phase. This indicates the model's performance on the validation set during training. ```text Result: Validation: | | 0/? [00:00