### Install Libraries for Semantic Chunking

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/semantic_chunking.ipynb

Install the necessary Python libraries for semantic chunking, vector embeddings, and nearest neighbor search. Use '-q' for quiet installation.

```python
# Install the necessary libraries
!pip install -q datasets model2vec numpy tqdm vicinity "chonkie[semantic]"
```

--------------------------------

### Install Model2Vec Training Extras

Source: https://github.com/minishlab/model2vec/blob/main/model2vec/train/README.md

Install the necessary components for training classifiers with model2vec.

```bash
pip install model2vec[train]
```

--------------------------------

### Install and Import Libraries

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/recipe_search.ipynb

Installs necessary Python packages and imports required libraries for recipe search and Model2Vec functionality.

```python
# Install the necessary libraries
!pip install numpy datasets scikit-learn transformers model2vec
    
# Import the necessary libraries
import regex
from collections import Counter

import numpy as np
from datasets import load_dataset
from sklearn.metrics import pairwise_distances
from tokenizers.pre_tokenizers import Whitespace

from model2vec import StaticModel
from model2vec.distill import distill
```

--------------------------------

### Install model2vec

Source: https://github.com/minishlab/model2vec/blob/main/model2vec/modelcards/model_card_template.md

Install the model2vec library using pip. This is the first step to using the library.

```bash
pip install model2vec
```

--------------------------------

### Install Model2Vec with Distillation Extras

Source: https://github.com/minishlab/model2vec/blob/main/README.md

Install the 'distillation' extras for Model2Vec to enable custom model distillation.

```bash
pip install model2vec[distill]
```

--------------------------------

### Install model2vec

Source: https://github.com/minishlab/model2vec/blob/main/model2vec/modelcards/classifier_template.md

Install the model2vec library with inference capabilities using pip.

```bash
pip install model2vec[inference]
```

--------------------------------

### Install Model2Vec and Dependencies

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb

Install the necessary libraries for Model2Vec training and inference, along with datasets and scikit-learn. This is a prerequisite for using the training functionalities.

```python
# Install the necessary libraries
!uv pip install "model2vec[train,inference]"
!uv pip install "datasets"
!uv pip install "scikit-learn"
```

--------------------------------

### Model Summary and Training Setup

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb

Displays the model architecture, parameter count, and indicates GPU availability. This output is typical after initializing a PyTorch Lightning Trainer.

```python
Seed set to 42
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/stephantulkens/Documents/GitHub/model2vec/.venv/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:76: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/Users/stephantulkens/Documents/GitHub/model2vec/.venv/lib/python3.11/site-packages/torch/optim/lr_scheduler.py:60: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
  warnings.warn(

  | Name  | Type                         | Params | Mode 
---------------------------------------------------------------
0 | model | StaticModelForClassification | 7.7 M  | train
---------------------------------------------------------------
7.7 M     Trainable params
0         Non-trainable params
7.7 M     Total params
30.922    Total estimated model params size (MB)
6         Modules in train mode
0         Modules in eval mode


```

--------------------------------

### Display Example Chunks

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/semantic_chunking.ipynb

Print a few randomly selected text chunks generated by Chonkie to inspect the results of the semantic chunking process.

```python
# Print a few example chunks
for _ in range(3):
    chunk = random.choice(chunks)
    print(chunk.text, "\n")
```

--------------------------------

### Load StaticModelPipeline

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb

Loads a StaticModelPipeline from a local directory or the Hugging Face Hub. This pipeline is optimized for fast cold starts as it does not require PyTorch.

```python
new_model = StaticModelPipeline.from_pretrained("my_cool_model")
# Or from the hub
# model = StaticModelPipeline.from_pretrained("my_org/my_model")
```

--------------------------------

### Train and Evaluate Classifier

Source: https://github.com/minishlab/model2vec/blob/main/model2vec/train/README.md

Train a classifier on a dataset and evaluate its performance. Assumes the 'datasets' library is installed.

```python
import numpy as np
from datasets import load_dataset
from time import perf_counter

# Load the subj dataset
ds = load_dataset("setfit/subj")
train = ds["train"]
test = ds["test"]

s = perf_counter()
classifier = classifier.fit(train["text"], train["label"])

print(f"Training took {int(perf_counter() - s)} seconds.")
# Training took 81 seconds
classification_report = classifier.evaluate(ds["test"]["text"], ds["test"]["label"])
print(classification_report)
# Achieved 91.0 test accuracy
```

--------------------------------

### Sanity Checking Progress

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb

Indicates the start of the sanity check phase for the model. This is a preliminary step before full training begins.

```text
Result:
Sanity Checking: |                                                                             | 0/? [00:00<?,…
```

--------------------------------

### Data Loading Warnings

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb

Provides warnings related to data loader configuration, suggesting improvements for performance by increasing the number of workers. This output appears during the setup phase of training.

```text
/Users/stephantulkens/Documents/GitHub/model2vec/.venv/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
/Users/stephantulkens/Documents/GitHub/model2vec/.venv/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
/Users/stephantulkens/Documents/GitHub/model2vec/.venv/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py:310: The number of training batches (29) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.


```

--------------------------------

### Distill a Model2Vec Model

Source: https://github.com/minishlab/model2vec/blob/main/model2vec/modelcards/model_card_template.md

Distill a Model2Vec model from a Sentence Transformer model using the distill method. Ensure the 'distill' extra is installed with 'pip install model2vec[distill]'.

```python
from model2vec.distill import distill

# Distill a Sentence Transformer model, in this case the BAAI/bge-base-en-v1.5 model
m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", pca_dims=256)

# Save the model
m2v_model.save_pretrained("m2v_model")
```

--------------------------------

### Perform Recipe Similarity Search

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/recipe_search.ipynb

This code snippet demonstrates how to encode recipe data into embeddings and then find the most similar recipes to a given query using the distilled Model2Vec model. It includes examples for searching 'cheeseburger' and 'fattoush'.

```python
# Find recipes using the output embeddings model
top_k = 5

# Find the most similar recipes to the given queries
query = "cheeseburger"
embeddings = model_custom.encode(recipes)

results = find_most_similar_items(model_custom, embeddings, query, top_k)
print(f"Most similar recipes to '{query}':")
for idx, score in results:
    print(f"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}")
    
print()

query = "fattoush"
results = find_most_similar_items(model_custom, embeddings, query, top_k)
print(f"Most similar recipes to '{query}':")
for idx, score in results:
    print(f"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}")
```

--------------------------------

### Load and Prepare Recipe Dataset

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/recipe_search.ipynb

Loads the recipe dataset from Hugging Face and prepares the 'title' column for use as the recipe corpus.

```python
# Load the recipe dataset
dataset = load_dataset("Shengtao/recipe", split="train")
# Convert the dataset to a pandas DataFrame
dataset = dataset.to_pandas()
# Take the title column as our recipes corpus
recipes = dataset["title"]
```

--------------------------------

### Print First 5 Training Samples

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb

Iterates through the first 5 records of the training dataset and prints their text and label.

```python
for record in dataset["train"].to_list()[:5]:
    print(f"TEXT: {record['text']} LABEL: {record['label_text']}")
```

--------------------------------

### Initialize Classifier from Pre-trained Model

Source: https://github.com/minishlab/model2vec/blob/main/model2vec/train/README.md

Create a StaticModelForClassification instance from a pre-trained model, defaulting to 'minishlab/potion-base-32m'.

```python
from model2vec.train import StaticModelForClassification

# From a pre-trained model: potion is the default
classifier = StaticModelForClassification.from_pretrained(model_name="minishlab/potion-base-32m")
```

--------------------------------

### Display Dataset Head

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/recipe_search.ipynb

Shows the first few rows of the dataset, including title, category, description, ingredients, and directions, for a quick overview.

```python
# Display the first few rows of the dataset for the specified columns
dataset[["title", "category", "description", "ingredients", "directions"]].head()
```

--------------------------------

### Initialize Multi-label Classifier

Source: https://github.com/minishlab/model2vec/blob/main/model2vec/train/README.md

Initialize a classifier for multi-label classification from a pre-trained model.

```python
from datasets import load_dataset
from model2vec.train import StaticModelForClassification

# Initialize a classifier from a pre-trained model
classifier = StaticModelForClassification.from_pretrained(model_name="minishlab/potion-base-32M")

# Load a multi-label dataset
ds = load_dataset("google-research-datasets/go_emotions")

# Inspect some of the labels
print(ds["train"]["labels"Показать 40:50])
# [[0, 15], [15, 18], [16, 27], [27], [7, 13], [10], [20], [27], [27], [27]]

# Train the classifier on text (X) and labels (y)
classifier.fit(ds["train"]["text"], ds["train"]["labels"])
```

--------------------------------

### Load Pipeline from Hugging Face Hub

Source: https://github.com/minishlab/model2vec/blob/main/model2vec/train/README.md

Load a previously saved pipeline from the Hugging Face hub for inference. This method is optimized for speed, loading in approximately 30ms.

```python
from model2vec.inference import StaticModelPipeline

pipeline = StaticModelPipeline.from_pretrained("my_cool/project")
```

--------------------------------

### Initialize Chonkie Semantic Chunking

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/semantic_chunking.ipynb

Initialize the SDPMChunker from Chonkie using the 'minishlab/potion-base-32M' embedding model. Configure chunking parameters like chunk size, skip window, and minimum sentences.

```python
# Initialize a SemanticChunker from Chonkie with the potion-base-8M model
chunker = SDPMChunker(
    embedding_model="minishlab/potion-base-32M",
    chunk_size = 512, 
    skip_window=5,
    min_sentences=3
)

# Chunk the text
time = perf_counter()
chunks = chunker.chunk(book_text)
print(f"Number of chunks: {len(chunks)}")
print(f"Time taken: {perf_counter() - time}")
```

--------------------------------

### Find Similar Recipes with Glove Model

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/recipe_search.ipynb

Encode recipes using the glove model and find similar recipes. This demonstrates the out-of-vocabulary problem for non-domain specific queries.

```python
top_k = 5

query = "cheeseburger"
embeddings = model_glove.encode(recipes)

results = find_most_similar_items(model_glove, embeddings, query, top_k)
print(f"Most similar recipes to '{query}':")
for idx, score in results:
    print(f"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}")
    
print()

query = "fattoush"
results = find_most_similar_items(model_glove, embeddings, query, top_k)
print(f"Most similar recipes to '{query}':")
for idx, score in results:
    print(f"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}")
```

--------------------------------

### Import Libraries for Semantic Chunking and Vector Search

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/semantic_chunking.ipynb

Import required libraries including random, re, requests, time, Chonkie's SDPMChunker, Model2Vec's StaticModel, and Vicinity. Set a random seed for reproducibility.

```python
# Import the necessary libraries
import random 
import re
import requests
from time import perf_counter
from chonkie import SDPMChunker
from model2vec import StaticModel
from vicinity import Vicinity

random.seed(0)
```

--------------------------------

### Define and Print StaticModelForClassification

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb

Loads a pre-trained StaticModelForClassification and prints its architecture. Optional arguments allow customization of the model name, number of layers, and hidden dimensions.

```python
# Define the staticmodel
model = StaticModelForClassification.from_pretrained()
# Optional arguments:
# model_name: the name of the base model (defaults to potion-base-8m)
# n_layers: the number of layers in the MLP (defaults to 1)
# hidden_dim: the number of hidden units (defaults to 512)
print(model)
```

--------------------------------

### Load and Predict with StaticModelPipeline

Source: https://github.com/minishlab/model2vec/blob/main/model2vec/inference/README.md

Load a pre-trained classifier from Hugging Face and use it to predict labels for given text. Ensure the model name is valid and accessible.

```python
from model2vec.inference import StaticModelPipeline

classifier = StaticModelPipeline.from_pretrained("minishlab/potion-8m-edu-classifier")
label = classifier.predict("Attitudes towards cattle in the Alps: a study in letting go.")
```

--------------------------------

### Create Custom Vocabulary and Distill Model2Vec

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/recipe_search.ipynb

This code snippet shows how to initialize a tokenizer, create a custom vocabulary from recipe data, and then distill a Model2Vec model using a specified Sentence Transformer model and the custom vocabulary.

```python
model_name = "BAAI/bge-small-en-v1.5"
tokenizer = Whitespace()

# Create a custom vocab from the recipe titles
vocab = create_vocab(recipes, tokenizer)

# Distill a model2vec model using the Sentence Transformer model and the custom vocab
model_custom = distill(model_name=model_name, vocabulary=vocab, pca_dims=256)
```

--------------------------------

### Load M2V Output Model

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/recipe_search.ipynb

Load a pre-trained Model2Vec base output model from the HuggingFace hub for recipe embedding.

```python
model_name = "minishlab/M2V_base_output"
model_output = StaticModel.from_pretrained(model_name)
```

--------------------------------

### Import Model2Vec Training and Inference Classes

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb

Import the `StaticModelForClassification` for training and `StaticModelPipeline` for inference. These are the core classes for using Model2Vec's classification features.

```python
# Import the necessary libraries
from model2vec.train import StaticModelForClassification
from model2vec.inference import StaticModelPipeline
```

--------------------------------

### Save and Push Pipeline to Hugging Face Hub

Source: https://github.com/minishlab/model2vec/blob/main/model2vec/train/README.md

Persist your trained pipeline to the Hugging Face hub for easy sharing and deployment. Ensure you have the necessary authentication set up.

```python
pipeline.save_pretrained(path)
pipeline.push_to_hub("my_cool/project")
```

--------------------------------

### Train StaticModelForClassification on a Subset

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb

Selects the first 1000 records from the training dataset and trains the StaticModelForClassification model on this subset. It also measures and prints the training time.

```python
import time
# Fit the model on the first 1000 records
subset = dataset["train"].select(range(1000))
s = time.time()
model = model.fit(subset["text"], subset["label_text"])
print(f"training took {time.time() - s} seconds")
```

--------------------------------

### Find Similar Recipes with Output Model

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/recipe_search.ipynb

Encode recipes using the output model and find the most similar recipes to a given query. Requires pre-defined `recipes` list and `find_most_similar_items` function.

```python
top_k = 5

query = "cheeseburger"
embeddings = model_output.encode(recipes)

results = find_most_similar_items(model_output, embeddings, query, top_k)
print(f"Most similar recipes to '{query}':")
for idx, score in results:
    print(f"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}")
    
print()

query = "fattoush"
results = find_most_similar_items(model_output, embeddings, query, top_k)
print(f"Most similar recipes to '{query}':")
for idx, score in results:
    print(f"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}")
```

--------------------------------

### Load Model2Vec and Predict

Source: https://github.com/minishlab/model2vec/blob/main/model2vec/modelcards/classifier_template.md

Load a pretrained Model2Vec model and use it for prediction. Ensure the StaticModelPipeline is imported.

```python
from model2vec.inference import StaticModelPipeline

# Load a pretrained Model2Vec model
model = StaticModelPipeline.from_pretrained("{{ model_name }}")

# Predict labels
predicted = model.predict(["Example sentence"])
```

--------------------------------

### Load the 20 Newsgroups Dataset

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb

Load the '20_newsgroups' dataset using the `datasets` library. This dataset is used to demonstrate the classifier training process.

```python
from datasets import load_dataset

dataset = load_dataset("setfit/20_newsgroups")
print(dataset)
```

--------------------------------

### Initialize Embedding Model and Encode Chunks

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/semantic_chunking.ipynb

Initialize the StaticModel from Model2Vec using the specified pre-trained model and encode the preprocessed chunk texts into embeddings. This prepares the text for vector search.

```python
# Initialize an embedding model and encode the chunk texts
time = perf_counter()
model = StaticModel.from_pretrained("minishlab/potion-base-32M")
chunk_texts = [chunk.text for chunk in chunks]
chunk_embeddings = model.encode(chunk_texts)
```

--------------------------------

### Load M2V Glove Model

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/recipe_search.ipynb

Load a pre-trained Model2Vec base glove model from the HuggingFace hub. This model is larger and may offer different performance characteristics.

```python
model_name = "minishlab/M2V_base_glove"
model_glove = StaticModel.from_pretrained(model_name)
```

--------------------------------

### Load and Use Model2Vec Model

Source: https://github.com/minishlab/model2vec/blob/main/model2vec/modelcards/model_card_template.md

Load a pretrained Model2Vec model using the StaticModel.from_pretrained method and compute text embeddings.

```python
from model2vec import StaticModel

# Load a pretrained Model2Vec model
model = StaticModel.from_pretrained("{{ model_name }}")

# Compute text embeddings
embeddings = model.encode(["Example sentence"])
```

--------------------------------

### Train and Evaluate Classifier with Model2Vec

Source: https://github.com/minishlab/model2vec/blob/main/README.md

Use this snippet to train a classifier on text data and evaluate its performance. Ensure the 'classifier' object is initialized and the dataset 'ds' is loaded.

```python
classifier.fit(ds["train"]["text"], ds["train"]["label"])
classification_report = classifier.evaluate(ds["test"]["text"], ds["test"]["label"])
```

--------------------------------

### Fine-tune a Classification Model

Source: https://github.com/minishlab/model2vec/blob/main/README.md

Initialize a classifier from a pre-trained Model2Vec model and load a dataset for fine-tuning. Supports both single and multi-label classification datasets.

```python
import numpy as np
from datasets import load_dataset
from model2vec.train import StaticModelForClassification

# Initialize a classifier from a pre-trained model
classifier = StaticModelForClassification.from_pretrained(model_name="minishlab/potion-base-32M")

# Load a dataset. Note: both single and multi-label classification datasets are supported
ds = load_dataset("setfit/subj")
```

--------------------------------

### Load and Use Pre-trained Model

Source: https://github.com/minishlab/model2vec/blob/main/README.md

Load a pre-trained Model2Vec model from the HuggingFace hub and generate embeddings for text. This is useful for tasks like text classification, retrieval, clustering, or building RAG systems.

```python
from model2vec import StaticModel

# Load a model from the HuggingFace hub (in this case the potion-base-32M model)
model = StaticModel.from_pretrained("minishlab/potion-base-32M")

# Make embeddings
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])

# Make sequences of token embeddings
token_embeddings = model.encode_as_sequence(["It's dangerous to go alone!", "It's a secret to everybody."])
```

--------------------------------

### Initialize Classifier from Distilled Model

Source: https://github.com/minishlab/model2vec/blob/main/model2vec/train/README.md

Create a StaticModelForClassification instance from a distilled model.

```python
from model2vec.distill import distill
from model2vec.train import StaticModelForClassification

# From a distilled model
distilled_model = distill("baai/bge-base-en-v1.5")
classifier = StaticModelForClassification.from_static_model(model=distilled_model)
```

--------------------------------

### Load and Use Sentence Transformer Model

Source: https://github.com/minishlab/model2vec/blob/main/model2vec/modelcards/model_card_template.md

Load a pretrained model using the Sentence Transformers library and compute text embeddings.

```python
from sentence_transformers import SentenceTransformer

# Load a pretrained Sentence Transformer model
model = SentenceTransformer("{{ model_name }}")

# Compute text embeddings
embeddings = model.encode(["Example sentence"])
```

--------------------------------

### Save Model Locally

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb

Saves a trained model to a local folder. This allows for quick loading without requiring PyTorch.

```python
# Fill in your own org
# pipeline.push_to_hub("my_org/my_model")
```

--------------------------------

### Training Progress Indicator

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb

Shows the progress bar for the training phase. This is a visual indicator of how far the training has progressed through the dataset.

```text
Result:
Training: |                                                                                    | 0/? [00:00<?,…
```

--------------------------------

### Create Vicinity Index

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/semantic_chunking.ipynb

Instantiate a Vicinity index from pre-computed vector embeddings and corresponding text chunks. This is useful for setting up a searchable knowledge base.

```python
vicinity = Vicinity.from_vectors_and_items(vectors=chunk_embeddings, items=chunk_texts)
```

--------------------------------

### Download and Preprocess Book Text

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/semantic_chunking.ipynb

Download the text of 'War and Peace' from Project Gutenberg and preprocess it by removing newlines and filtering short sentences. Sentences are defined as text ending in '.', '!', or '?'.

```python
# URL for War and Peace on Project Gutenberg
url = "https://www.gutenberg.org/files/2600/2600-0.txt"

# Download the book
response = requests.get(url)
book_text = response.text

def preprocess_text(text: str, min_length: int = 5):
    """Basic text preprocessing function."""
    text = text.replace("\n", " ")
    text = text.replace("\r", " ")
    sentences = re.findall(r'[^.!?]*[.!?]', text)
    # Filter out sentences shorter than the specified minimum length
    filtered_sentences = [sentence.strip() for sentence in sentences if len(sentence.split()) >= min_length]
    # Recombine the filtered sentences
    return ' '.join(filtered_sentences)

# Preprocess the text
book_text = preprocess_text(book_text)
```

--------------------------------

### Distill a Custom Model2Vec Model

Source: https://github.com/minishlab/model2vec/blob/main/README.md

Distill a Sentence Transformer model into a Model2Vec model. This process can be done on a CPU in approximately 30 seconds. The distilled model can then be saved.

```python
from model2vec.distill import distill

# Distill a Sentence Transformer model, in this case the BAAI/bge-base-en-v1.5 model
m2v_model = distill(model_name="BAAI/bge-base-en-v1.5")

# Save the model
m2v_model.save_pretrained("m2v_model")
```

--------------------------------

### Convert Classifier to Scikit-learn Pipeline

Source: https://github.com/minishlab/model2vec/blob/main/model2vec/train/README.md

Use this to convert a trained classifier into a scikit-learn compatible pipeline object for easier integration and persistence.

```python
pipeline = classifier.to_pipeline()
```

--------------------------------

### Create Vocabulary from Texts

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/recipe_search.ipynb

Defines a function to create a vocabulary from a list of texts using a regex tokenizer and a counter. The vocabulary is sorted by token frequency.

```python
my_regex = regex.compile(r"\w+|[^\w\s]+")

def create_vocab(texts: list[str], tokenizer: Whitespace, size: int = 30_000) -> list[str]:
    """
    Create a vocab from a list of texts.
    
    :param texts: A list of texts.
    :param tokenizer: A whitespace tokenizer.
    :param size: The size of the vocab.
    :return: A vocab sorted by frequency.
    """
    counts = Counter()
    for text in texts:
        tokens = tokenizer.pre_tokenize_str(text.lower())
        tokens = [token for token, _ in tokens]
        counts.update(tokens)
    vocab = [word for word, _ in counts.most_common(size)]
    return vocab
```

--------------------------------

### Export Model to Scikit-learn Pipeline

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb

Converts the trained model to a scikit-learn pipeline. This allows for consistent prediction and evaluation using standard scikit-learn tools.

```python
pipeline = model.to_pipeline()

predictions = pipeline.predict(dataset["test"]["text"])

print(classification_report(dataset["test"]["label_text"], predictions))
```

--------------------------------

### Save Model Locally

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb

Save the trained pipeline to a local directory. This is useful for later use or deployment.

```python
pipeline.save_pretrained("my_cool_model")
```

--------------------------------

### Similarity Search Function

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/recipe_search.ipynb

Defines a function to find the most similar recipe titles to a given query using a Model2Vec model and precomputed embeddings. It calculates cosine similarity and returns the top K results.

```python
# Define a function to find the most similar titles in a dataset to a given query
def find_most_similar_items(model: StaticModel, embeddings: np.ndarray, query: str, top_k=5) -> list[tuple[int, float]]:
    """
    Finds the most similar items in a dataset to the given query using the specified model.

    :param model: The model used to generate embeddings.
    :param embeddings: The embeddings of the dataset.
    :param query: The query recipe title.
    :param top_k: The number of most similar titles to return.
    :return: A list of tuples containing the most similar titles and their cosine similarity scores.
    """
    # Generate embedding for the query
    query_embedding = model.encode(query)[None, :]

    # Calculate pairwise cosine distances between the query and the precomputed embeddings
    distances = pairwise_distances(query_embedding, embeddings, metric='cosine')[0]

    # Get the indices of the most similar items (sorted in ascending order because smaller distances are better)
    most_similar_indices = np.argsort(distances)

    # Convert distances to similarity scores (cosine similarity = 1 - cosine distance)
    most_similar_scores = [1 - distances[i] for i in most_similar_indices[:top_k]]

    # Return the top-k most similar indices and similarity scores
    return list(zip(most_similar_indices[:top_k], most_similar_scores))
```

--------------------------------

### Evaluate Multi-label Classifier

Source: https://github.com/minishlab/model2vec/blob/main/model2vec/train/README.md

Evaluate a multi-label classifier using specified text, labels, and a classification threshold.

```python
from sklearn import metrics
from sklearn.preprocessing import MultiLabelBinarizer

classification_report = classifier.evaluate(ds["test"]["text"], ds["test"]["labels"Показать], threshold=0.3)
print(classification_report)
# Accuracy: 0.410
# Precision: 0.527
# Recall: 0.410
# F1: 0.439
```

--------------------------------

### Print Classification Report

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb

Generates a classification report to evaluate the model's performance on the test dataset. This includes precision, recall, F1-score, and support for each class.

```python
from sklearn.metrics import classification_report

predictions = model.predict(dataset["test"]["text"])

print(classification_report(dataset["test"]["label_text"], predictions))
```

--------------------------------

### Validation Progress Indicator

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb

Displays the progress bar for the validation phase. This indicates the model's performance on the validation set during training.

```text
Result:
Validation: |                                                                                  | 0/? [00:00<?,…
```

--------------------------------

### Predict with Trained Classifier

Source: https://github.com/minishlab/model2vec/blob/main/model2vec/train/README.md

Measure the prediction time for a trained classifier on a test set.

```python
from time import perf_counter

s = perf_counter()
classifier.predict(test["text"])
print(f"Took {int((perf_counter() - s) * 1000)} milliseconds for {len(test)} instances on CPU.")
# Took 67 milliseconds for 2000 instances on CPU.
```

--------------------------------

### Train TF-IDF Pipeline with Logistic Regression

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/train_classifier.ipynb

Trains a scikit-learn pipeline that uses TfidfVectorizer and LogisticRegression. This is used for comparison against the main model.

```python
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

sklearn_pipeline = make_pipeline(TfidfVectorizer(), LogisticRegression())
sklearn_pipeline.fit(subset["text"], subset["label_text"])
predictions = sklearn_pipeline.predict(dataset["test"]["text"])

print(classification_report(dataset["test"]["label_text"], predictions))
```

--------------------------------

### Query Vicinity Index

Source: https://github.com/minishlab/model2vec/blob/main/tutorials/semantic_chunking.ipynb

Query the Vicinity index with natural language queries to retrieve semantically similar text chunks. Requires a pre-trained model for encoding queries and a Vicinity index.

```python
queries = ["Emperor Napoleon", "The battle of Austerlitz", "Paris"]
for query in queries:
    print(f"Query: {query}\n{'-' * 50}")
    query_embedding = model.encode(query)
    results = vicinity.query(query_embedding, k=3)[0]

    for result in results:
        print(result[0], "\n")
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.