### Install Tokenlearn Package

Source: https://github.com/minishlab/tokenlearn/blob/main/README.md

Install the tokenlearn package using pip. This is the first step before using the featurize or train functionalities.

```bash
pip install tokenlearn
```

--------------------------------

### Install Evaluation Dependencies

Source: https://github.com/minishlab/tokenlearn/blob/main/README.md

Install the optional evaluation dependencies from a Git repository. This is required before running model evaluations.

```bash
pip install evaluation@git+https://github.com/MinishLab/evaluation@main
```

--------------------------------

### Featurize Dataset with Tokenlearn CLI

Source: https://github.com/minishlab/tokenlearn/blob/main/README.md

Use the tokenlearn.featurize CLI to create featurized datasets from HuggingFace datasets. Specify the sentence transformer model, dataset path, name, split, and output directory.

```bash
python -m tokenlearn.featurize \
    --model-name "baai/bge-base-en-v1.5" \
    --dataset-path "allenai/c4" \
    --dataset-name "en" \
    --dataset-split "train" \
    --output-dir "data/c4_features"
```

--------------------------------

### Create Dataset with Tokenlearn Featurize CLI

Source: https://github.com/minishlab/tokenlearn/blob/main/tokenlearn/datacards/dataset_card_template.md

Generate a dataset card using the `tokenlearn-featurize` CLI. This command requires specifying the embedding model, source dataset details, and an output directory.

```bash
python -m tokenlearn.featurize \
    --model-name "{{ model_name }}" \
    --dataset-path "{{ source_dataset }}" \
    --dataset-name "{{ source_name }}" \
    --dataset-split "{{ source_split }}" \
    --output-dir "<output-dir>"
```

--------------------------------

### Featurize Dataset and Push to Hub

Source: https://github.com/minishlab/tokenlearn/blob/main/README.md

Optionally push the featurized dataset to the HuggingFace Hub after creation. Requires specifying the HuggingFace model name, output directory, and the desired Hub repository ID.

```bash
python -m tokenlearn.featurize \
    --model-name "baai/bge-base-en-v1.5" \
    --output-dir "data/c4_features" \
    --push-to-hub "username/my-featurized-dataset"
```

--------------------------------

### Load Dataset with Datasets Library

Source: https://github.com/minishlab/tokenlearn/blob/main/tokenlearn/datacards/dataset_card_template.md

Load a dataset using the Hugging Face `datasets` library. Ensure the dataset name matches the repository ID or dataset name specified.

```python
from datasets import load_dataset

dataset = load_dataset("{{ repo_id or dataset_name }}")
```

--------------------------------

### Train Model2Vec with Tokenlearn CLI

Source: https://github.com/minishlab/tokenlearn/blob/main/README.md

Train a Model2Vec model using the tokenlearn.train CLI on a featurized dataset. Requires specifying the sentence transformer model, the path to the featurized data, and where to save the trained model.

```bash
python -m tokenlearn.train \
    --model-name "baai/bge-base-en-v1.5" \
    --data-path "data/c4_features" \
    --save-path "<path-to-save-model>"
```

--------------------------------

### Train Model2Vec with HuggingFace Hub Data

Source: https://github.com/minishlab/tokenlearn/blob/main/README.md

Train a Model2Vec model using data directly from the HuggingFace Hub. The `--data-path` argument accepts a Hub repository ID.

```bash
python -m tokenlearn.train \
    --model-name "baai/bge-base-en-v1.5" \
    --data-path "username/my-featurized-dataset" \
    --save-path "<path-to-save-model>"
```

--------------------------------

### Train Model2Vec with Tokenlearn

Source: https://github.com/minishlab/tokenlearn/blob/main/tokenlearn/datacards/dataset_card_template.md

Train a Model2Vec model using the Tokenlearn CLI. Specify the embedding model name, data path, and save path for the trained model.

```bash
python -m tokenlearn.train \
    --model-name "{{ model_name }}" \
    --data-path "{{ repo_id or dataset_name }}" \
    --save-path "<path-to-save-model>"
```

--------------------------------

### Evaluate Trained Model using CustomMTEB

Source: https://github.com/minishlab/tokenlearn/blob/main/README.md

Python code to evaluate a trained Tokenlearn model using the CustomMTEB class from the evaluation library. It loads tasks, runs the evaluation, and prints a leaderboard.

```python
from model2vec import StaticModel

from evaluation import CustomMTEB, get_tasks, parse_mteb_results, make_leaderboard, summarize_results
from mteb import ModelMeta

# Get all available tasks
tasks = get_tasks()
evaluation = CustomMTEB(tasks=tasks)

# Load a trained model
model_name = "tokenlearn_model"
model = StaticModel.from_pretrained(model_name)

# Optionally, add model metadata in MTEB format
model.mteb_model_meta = ModelMeta(
    name=model_name, revision="no_revision_available", release_date=None, languages=None
)

# Run the evaluation
results = evaluation.run(model, eval_splits=["test"], output_folder="results")

# Parse and print results
parsed_results = parse_mteb_results(mteb_results=results, model_name=model_name)
task_scores = summarize_results(parsed_results)
print(make_leaderboard(task_scores))
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.