### Install Tokenlearn Package Source: https://github.com/minishlab/tokenlearn/blob/main/README.md Install the tokenlearn package using pip. This is the first step before using the featurize or train functionalities. ```bash pip install tokenlearn ``` -------------------------------- ### Install Evaluation Dependencies Source: https://github.com/minishlab/tokenlearn/blob/main/README.md Install the optional evaluation dependencies from a Git repository. This is required before running model evaluations. ```bash pip install evaluation@git+https://github.com/MinishLab/evaluation@main ``` -------------------------------- ### Featurize Dataset with Tokenlearn CLI Source: https://github.com/minishlab/tokenlearn/blob/main/README.md Use the tokenlearn.featurize CLI to create featurized datasets from HuggingFace datasets. Specify the sentence transformer model, dataset path, name, split, and output directory. ```bash python -m tokenlearn.featurize \ --model-name "baai/bge-base-en-v1.5" \ --dataset-path "allenai/c4" \ --dataset-name "en" \ --dataset-split "train" \ --output-dir "data/c4_features" ``` -------------------------------- ### Create Dataset with Tokenlearn Featurize CLI Source: https://github.com/minishlab/tokenlearn/blob/main/tokenlearn/datacards/dataset_card_template.md Generate a dataset card using the `tokenlearn-featurize` CLI. This command requires specifying the embedding model, source dataset details, and an output directory. ```bash python -m tokenlearn.featurize \ --model-name "{{ model_name }}" \ --dataset-path "{{ source_dataset }}" \ --dataset-name "{{ source_name }}" \ --dataset-split "{{ source_split }}" \ --output-dir "" ``` -------------------------------- ### Featurize Dataset and Push to Hub Source: https://github.com/minishlab/tokenlearn/blob/main/README.md Optionally push the featurized dataset to the HuggingFace Hub after creation. Requires specifying the HuggingFace model name, output directory, and the desired Hub repository ID. ```bash python -m tokenlearn.featurize \ --model-name "baai/bge-base-en-v1.5" \ --output-dir "data/c4_features" \ --push-to-hub "username/my-featurized-dataset" ``` -------------------------------- ### Load Dataset with Datasets Library Source: https://github.com/minishlab/tokenlearn/blob/main/tokenlearn/datacards/dataset_card_template.md Load a dataset using the Hugging Face `datasets` library. Ensure the dataset name matches the repository ID or dataset name specified. ```python from datasets import load_dataset dataset = load_dataset("{{ repo_id or dataset_name }}") ``` -------------------------------- ### Train Model2Vec with Tokenlearn CLI Source: https://github.com/minishlab/tokenlearn/blob/main/README.md Train a Model2Vec model using the tokenlearn.train CLI on a featurized dataset. Requires specifying the sentence transformer model, the path to the featurized data, and where to save the trained model. ```bash python -m tokenlearn.train \ --model-name "baai/bge-base-en-v1.5" \ --data-path "data/c4_features" \ --save-path "" ``` -------------------------------- ### Train Model2Vec with HuggingFace Hub Data Source: https://github.com/minishlab/tokenlearn/blob/main/README.md Train a Model2Vec model using data directly from the HuggingFace Hub. The `--data-path` argument accepts a Hub repository ID. ```bash python -m tokenlearn.train \ --model-name "baai/bge-base-en-v1.5" \ --data-path "username/my-featurized-dataset" \ --save-path "" ``` -------------------------------- ### Train Model2Vec with Tokenlearn Source: https://github.com/minishlab/tokenlearn/blob/main/tokenlearn/datacards/dataset_card_template.md Train a Model2Vec model using the Tokenlearn CLI. Specify the embedding model name, data path, and save path for the trained model. ```bash python -m tokenlearn.train \ --model-name "{{ model_name }}" \ --data-path "{{ repo_id or dataset_name }}" \ --save-path "" ``` -------------------------------- ### Evaluate Trained Model using CustomMTEB Source: https://github.com/minishlab/tokenlearn/blob/main/README.md Python code to evaluate a trained Tokenlearn model using the CustomMTEB class from the evaluation library. It loads tasks, runs the evaluation, and prints a leaderboard. ```python from model2vec import StaticModel from evaluation import CustomMTEB, get_tasks, parse_mteb_results, make_leaderboard, summarize_results from mteb import ModelMeta # Get all available tasks tasks = get_tasks() evaluation = CustomMTEB(tasks=tasks) # Load a trained model model_name = "tokenlearn_model" model = StaticModel.from_pretrained(model_name) # Optionally, add model metadata in MTEB format model.mteb_model_meta = ModelMeta( name=model_name, revision="no_revision_available", release_date=None, languages=None ) # Run the evaluation results = evaluation.run(model, eval_splits=["test"], output_folder="results") # Parse and print results parsed_results = parse_mteb_results(mteb_results=results, model_name=model_name) task_scores = summarize_results(parsed_results) print(make_leaderboard(task_scores)) ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.