### Hybrid Retrieval Setup (Dense and Sparse)

Source: https://github.com/flagopen/flagembedding/blob/master/research/C_MTEB/MKQA/README.md

Instructions for setting up hybrid retrieval, which combines dense and sparse methods. This section reiterates the installation of Java, Pyserini, and Faiss, and includes the same dense retrieval steps as the dense retrieval section.

```bash
# install java (Linux)
apt update
apt install openjdk-11-jdk

# install pyserini
pip install pyserini

# install faiss
## CPU version
conda install -c conda-forge faiss-cpu

## GPU version
conda install -c conda-forge faiss-gpu
```

```python
cd dense_retrieval

# 1. Generate Corpus Embedding
python step0-generate_embedding.py \
--encoder BAAI/bge-m3 \
--index_save_dir ./corpus-index \
--max_passage_length 512 \
--batch_size 256 \
--fp16 \
--add_instruction False \
--pooling_method cls \
--normalize_embeddings True

# 2. Search Results
python step1-search_results.py \
--encoder BAAI/bge-m3 \
--languages ar da de es fi fr he hu it ja km ko ms nl no pl pt ru sv th tr vi zh_cn zh_hk zh_tw \
--index_save_dir ./corpus-index \
--result_save_dir ./search_results \
--qa_data_dir ../qa_data \
--threads 16 \
--batch_size 32 \
--hits 1000 \
--pooling_method cls \
--normalize_embeddings True \
--add_instruction False

# 3. Print and Save Evaluation Results
python step2-eval_dense_mkqa.py \
--encoder BAAI/bge-m3 \
--languages ar da de es fi fr he hu it ja km ko ms nl no pl pt ru sv th tr vi zh_cn zh_hk zh_tw \
--search_result_save_dir ./search_results \
--qa_data_dir ../qa_data \
--eval_result_save_dir ./eval_results \
--metrics recall@20 recall@100 \
--threads 32 \
--pooling_method cls \
--normalize_embeddings True
```

--------------------------------

### Install C-MTEB using pip

Source: https://github.com/flagopen/flagembedding/blob/master/docs/source/C-MTEB.md

Install the C-MTEB package using pip for quick setup.

```bash
pip install -U C_MTEB
```

--------------------------------

### Install C-MTEB from source

Source: https://github.com/flagopen/flagembedding/blob/master/docs/source/C-MTEB.md

Install C-MTEB by cloning the FlagEmbedding repository and installing from the local source. This is useful for development or if you need the latest changes.

```bash
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding/C_MTEB
pip install -e .
```

--------------------------------

### Install FlagEmbedding from source

Source: https://github.com/flagopen/flagembedding/blob/master/examples/finetune/reranker/README.md

Clone the FlagEmbedding repository and install it from source with the finetune extra. For development, use an editable install.

```shell
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install  .[finetune]
```

```shell
pip install -e .[finetune]
```

--------------------------------

### Initialize BGE English ICL Model with Examples

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/1_Embedding/1.2.1_BGE_Series.ipynb

Initialize the FlagICLModel with a specified model and examples for in-context learning. Set examples_for_task to None to use the model without examples. The examples_instruction_format can be specified to define how examples are formatted.

```python
from FlagEmbedding import FlagICLModel
import os

model = FlagICLModel('BAAI/bge-en-icl', 
                     examples_for_task=examples,  # set `examples_for_task=None` to use model without examples
                    #  examples_instruction_format="<instruct>{}\n<query>{}\n<response>{}"

```

--------------------------------

### Install Required Libraries

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/2_Metrics/2.2_Eval_Metrics.ipynb

Installs numpy and scikit-learn, which are necessary for the subsequent code examples.

```python
%pip install numpy scikit-learn
```

--------------------------------

### Install C-MTEB from source

Source: https://github.com/flagopen/flagembedding/blob/master/research/C_MTEB/README.md

Clone the repository and install C-MTEB in editable mode for development.

```bash
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding/research/C_MTEB
pip install -e .
```

--------------------------------

### Install Java (Linux)

Source: https://github.com/flagopen/flagembedding/blob/master/research/C_MTEB/MKQA/README.md

Installs OpenJDK 11, a prerequisite for using Pyserini and Faiss in certain environments.

```bash
apt update
apt install openjdk-11-jdk
```

--------------------------------

### Install Sentence-Transformers

Source: https://github.com/flagopen/flagembedding/blob/master/examples/inference/embedder/README.md

Install the sentence-transformers library to use BGE models with this framework. This is a prerequisite for the following code examples.

```shell
pip install -U sentence-transformers
```

--------------------------------

### Install LM-Cocktail from Source

Source: https://github.com/flagopen/flagembedding/blob/master/research/LM_Cocktail/README.md

Clone the repository and install the package in editable mode for development.

```bash
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding/research/LM_Cocktail
pip install -e .
```

--------------------------------

### Install Dependencies

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/4_Evaluation/4.1.1_Evaluation_MSMARCO.ipynb

Installs the necessary libraries, FlagEmbedding and faiss-cpu, for the evaluation pipeline.

```python
%pip install -U FlagEmbedding faiss-cpu
```

--------------------------------

### Setup Unsloth Environment

Source: https://github.com/flagopen/flagembedding/blob/master/research/Long_LLM/longllm_qlora/README.md

Installs necessary packages for Unsloth and Llama-3 fine-tuning, including PyTorch, transformers, and flash-attn. Ensure you are in the 'unsloth' conda environment.

```bash
conda create -n unsloth python=3.10
conda activate unsloth

conda install pytorch==2.2.2 pytorch-cuda=12.1 cudatoolkit xformers -c pytorch -c nvidia -c xformers
pip install transformers==4.39.3 deepspeed accelerate datasets==2.18.0 peft bitsandbytes
pip install flash-attn --no-build-isolation
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# these packages are used in evaluation
pip install rouge fuzzywuzzy jieba pandas seaborn python-Levenshtein
```

--------------------------------

### Format Detailed Example for LLM

Source: https://github.com/flagopen/flagembedding/blob/master/examples/inference/embedder/README.md

Constructs a formatted example string for LLM input, including task description, query, and response.

```python
def get_detailed_example(task_description: str, query: str, response: str) -> str:
    return f'<instruct>{task_description}
<query>{query}
<response>{response}'
```

--------------------------------

### Install Project Dependencies

Source: https://github.com/flagopen/flagembedding/blob/master/docs/README.md

Run this command to install all required Python packages listed in requirements.txt.

```bash
pip install -r requirements.txt
```

--------------------------------

### Install FlagEmbedding from Source

Source: https://github.com/flagopen/flagembedding/blob/master/docs/source/Introduction/installation.md

Clone the repository and install the package from local sources. Options for including finetuning dependencies are available.

```bash
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install  .
```

```bash
pip install  .[finetune]
```

--------------------------------

### Install FlagEmbedding Package

Source: https://github.com/flagopen/flagembedding/blob/master/research/BGE_M3/README.md

Use this command to clone the repository and install the package locally. Alternatively, install directly using pip.

```bash
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install -e .
```

```bash
pip install -U FlagEmbedding
```

--------------------------------

### Install Dependencies

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/7_Fine-tuning/7.1.3_Eval_FT_Model.ipynb

Install necessary libraries for dataset handling, evaluation, and embedding models.

```python
% pip install -U datasets pytrec_eval FlagEmbedding
```

--------------------------------

### Install faiss-cpu

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/3_Indexing/3.1.1_Intro_to_Faiss.ipynb

Install the CPU version of Faiss using pip. This is a simpler installation for systems without compatible GPUs.

```python
%pip install -U faiss-cpu
```

--------------------------------

### Install Pyserini

Source: https://github.com/flagopen/flagembedding/blob/master/research/C_MTEB/MKQA/README.md

Installs the Pyserini library, a Python toolkit for reproducible information retrieval research.

```bash
pip install pyserini
```

--------------------------------

### Initialize BGE English ICL Model with Examples

Source: https://github.com/flagopen/flagembedding/blob/master/docs/source/tutorial/1_Embedding/1.2.1.ipynb

Use FlagICLModel for English embedding tasks with in-context learning. You can provide few-shot examples via `examples_for_task`. Setting `examples_for_task=None` uses the model without examples. The `examples_instruction_format` can be specified to define the format of the examples.

```python
from FlagEmbedding import FlagICLModel
import os

model = FlagICLModel('BAAI/bge-en-icl', 
                     examples_for_task=examples,  # set `examples_for_task=None` to use model without examples
                    #  examples_instruction_format="<instruct>{}\n<query>{}\n<response>{}" # specify the format to use examples_for_task

```

--------------------------------

### Training Data Format Example

Source: https://github.com/flagopen/flagembedding/blob/master/research/llm_embedder/docs/fine-tune.md

Example structure for training data, including query, positive/negative examples, and optional fields for distillation and scoring.

```python
# training
{
  "query": str,
  "pos": List[str],
  "neg": List[str],
  "pos_index": Optional[List[int]],         # Indices of the positives w.r.t. the corpus. When a global corpus is not available (e.g. long conversation), just ignore this field.
  "neg_index": Optional[List[int]],         # Indices of the negatives w.r.t. the corpus. When a global corpus is not available (e.g. long conversation), just ignore this field.
  "teacher_scores": Optional[List[float]],  # Scores from an LM or a reranker, used for distillation.
  "answers": Optional[List[str]],           # List of answers for the query, used for LM scoring.
}
```

--------------------------------

### Install FlagEmbedding and MTEB

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/4_Evaluation/4.2.3_C-MTEB.ipynb

Install the required libraries for C-MTEB evaluation.

```python
%pip install FlagEmbedding mteb
```

--------------------------------

### Install FlagEmbedding

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/1_Embedding/1.2.2_Auto_Embedder.ipynb

Install the FlagEmbedding library using pip. This is the first step before using any of its functionalities.

```python
% pip install FlagEmbedding
```

--------------------------------

### Install FlagEmbedding from Source (No Finetune)

Source: https://github.com/flagopen/flagembedding/blob/master/README.md

Clone the repository and install the FlagEmbedding package from source without finetune dependencies. This method is useful for development or if pip installation is not preferred.

```bash
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install  .
```

--------------------------------

### Install Required Packages

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/6_RAG/6.2_RAG_LangChain.ipynb

Install necessary Python packages for LangChain, PDF loading, and OpenAI/HuggingFace integrations. This is the first step to set up the environment.

```python
%pip install pypdf langchain langchain-openai langchain-huggingface
```

--------------------------------

### Install FlagEmbedding Package

Source: https://github.com/flagopen/flagembedding/blob/master/research/visual_bge/README.md

Clone the FlagEmbedding repository and install the visual_bge research package in editable mode.

```bash
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding/research/visual_bge
pip install -e .
```

--------------------------------

### Install FlagEmbedding from source

Source: https://github.com/flagopen/flagembedding/blob/master/examples/README.md

Clone the repository and install FlagEmbedding from source. Use this for development or if you need the latest unreleased features.

```shell
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install  .
```

```shell
pip install -e .
```

--------------------------------

### Install FlagEmbedding with Pip

Source: https://github.com/flagopen/flagembedding/blob/master/docs/source/Introduction/installation.md

Install the package using pip. Use the '[finetune]' option if finetuning capabilities are needed.

```bash
pip install -U FlagEmbedding
```

```bash
pip install -U FlagEmbedding[finetune]
```

--------------------------------

### Install Datasets Library

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/7_Fine-tuning/7.1.1_Data_preparation.ipynb

Installs or upgrades the Hugging Face datasets library. This is a prerequisite for loading and manipulating datasets.

```python
% pip install -U datasets
```

--------------------------------

### Install Sentence Transformers

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/4_Evaluation/4.3.1_Sentence_Transformers_Eval.ipynb

Install the Sentence Transformers library using pip. This is a prerequisite for using the library.

```python
%pip install -U sentence-transformers
```

--------------------------------

### Install FlagEmbedding from Source (Editable Mode)

Source: https://github.com/flagopen/flagembedding/blob/master/docs/source/Introduction/installation.md

Install the package in editable mode from source for development. Include '[finetune]' for finetuning support.

```bash
pip install -e .

```

```bash
pip install -e .[finetune]
```

--------------------------------

### Hybrid Retrieval Example (Milvus)

Source: https://github.com/flagopen/flagembedding/blob/master/research/BGE_M3/README.md

Demonstrates hybrid retrieval using BGE-M3 with Milvus. This example shows how to integrate dense and sparse retrieval capabilities.

```python
from pymilvus import connections, CollectionSchema, FieldSchema, DataType, Collection

# Connect to Milvus
connections.connect("default", host="localhost", port="19530")

# Define schema for hybrid search
fields = [
    FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="sparse_vec", dtype=DataType.FLOAT_VECTOR, dim=100), # Example dimension for sparse vector
    FieldSchema(name="dense_vec", dtype=DataType.FLOAT_VECTOR, dim=768), # Example dimension for dense vector
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=2048)
]
schema = CollectionSchema(fields, "BGE-M3 hybrid search example")

# Create collection
collection = Collection("bge_m3_hybrid", schema)

# Create index for sparse and dense vectors
index_params_sparse = {"metric_type": "IP", "params": {"nlist": 1024}}
collection.create_index("sparse_vec", index_params_sparse)
index_params_dense = {"metric_type": "IP", "params": {"nlist": 1024}}
collection.create_index("dense_vec", index_params_dense)

# Load collection for searching
collection.load()

# Example search query (replace with actual query)
query_sparse = [[0.1] * 100] # Example sparse query vector
query_dense = [[0.2] * 768] # Example dense query vector

# Perform hybrid search
search_params = {"metric_type": "IP", "params": {"nprobe": 10}}
results = collection.search(
    query_dense, "dense_vec", search_params, limit=3, expr="" # Add expression if needed
)

# Process search results
for hit in results[0]:
    print(f"ID: {hit.id}, Score: {hit.score}")

# Note: This is a simplified example. Actual implementation may require more setup and data loading.
```

--------------------------------

### Install Transformers Library

Source: https://github.com/flagopen/flagembedding/blob/master/research/llm_embedder/README.md

Install the transformers library using pip. This is an alternative method for using the LLM-Embedder model.

```bash
pip install -U transformers
```

--------------------------------

### Install Faiss (CPU Version)

Source: https://github.com/flagopen/flagembedding/blob/master/research/C_MTEB/MKQA/README.md

Installs the CPU version of Faiss, a library for efficient similarity search and clustering of dense vectors.

```bash
conda install -c conda-forge faiss-cpu
```

--------------------------------

### Initialize and Use FlagICLModel

Source: https://github.com/flagopen/flagembedding/blob/master/examples/inference/embedder/README.md

Demonstrates initializing FlagICLModel with specific instructions and examples for retrieval tasks. Ensure `use_fp16` is set appropriately for performance and that devices are correctly specified.

```python
from FlagEmbedding import FlagICLModel

examples = [
    {
        'instruct': 'Given a web search query, retrieve relevant passages that answer the query.',
        'query': 'what is a virtual interface',
        'response': "A virtual interface is a software-defined abstraction that mimics the behavior and characteristics of a physical network interface. It allows multiple logical network connections to share the same physical network interface, enabling efficient utilization of network resources. Virtual interfaces are commonly used in virtualization technologies such as virtual machines and containers to provide network connectivity without requiring dedicated hardware. They facilitate flexible network configurations and help in isolating network traffic for security and management purposes."
    },
    {
        'instruct': 'Given a web search query, retrieve relevant passages that answer the query.',
        'query': 'causes of back pain in female for a week',
        'response': "Back pain in females lasting a week can stem from various factors. Common causes include muscle strain due to lifting heavy objects or improper posture, spinal issues like herniated discs or osteoporosis, menstrual cramps causing referred pain, urinary tract infections, or pelvic inflammatory disease. Pregnancy-related changes can also contribute. Stress and lack of physical activity may exacerbate symptoms. Proper diagnosis by a healthcare professional is crucial for effective treatment and management."
    }
]
model = FlagICLModel(
    'BAAI/bge-en-icl',
    query_instruction_for_retrieval="Given a question, retrieve passages that answer the question.",
    query_instruction_format="<instruct>{}\n<query>{}",
    examples_for_task=examples,
    examples_instruction_format="<instruct>{}\n<query>{}\n<response>{}",
    use_fp16=True,
    devices=['cuda:1']
) # Setting use_fp16 to True speeds up computation with a slight performance degradation
queries = [
    "how much protein should a female eat",
    "summit define"
]
passages = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]
q_embeddings = model.encode_queries(queries)
p_embeddings = model.encode_corpus(passages)
scores = q_embeddings @ p_embeddings.T
print(scores)
```

--------------------------------

### Mine Negatives using BM25

Source: https://github.com/flagopen/flagembedding/blob/master/research/llm_embedder/docs/fine-tune.md

This command mines negative examples using the BM25 retrieval method. Ensure `anserini_dir` points to your Anserini installation.

```bash
# BM25 (the result will be saved at llm-embedder:qa/nq/train.neg.bm25.json; anserini_dir is the folder where you untar anserini.tar.gz)
torchrun --nproc_per_node 8 -m evaluation.eval_retrieval \
--anserini_dir /data/anserini \
--retrieval_method bm25 \
--eval_data llm-embedder:qa/nq/train.json \
--corpus llm-embedder:qa/nq/corpus.json \
--metrics mrr recall collate_neg \
--save_name bm25 \
--data_root /data/llm-embedder
```

--------------------------------

### Download and Set Up Java and Anserini for BM25

Source: https://github.com/flagopen/flagembedding/blob/master/research/llm_embedder/docs/fine-tune.md

Download Java 11 and Anserini, then extract them. Temporarily set JAVA_HOME and add Java's bin directory to your PATH. It's recommended to add these exports to your ~/.bashrc for persistence.

```bash
# feel free to alternate /data to your prefered location
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/java11.tar.gz?download=true -O /data/java11.tar.gz
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/anserini.tar.gz?download=true -O /data/anserini.tar.gz

cd /data
tar -xzvf java11.tar.gz
tar -xzvf anserini.tar.gz

# below just temporarily set JAVA_HOME; it is RECOMMENDED that you store the lines the setting in ~/.bashrc
export JAVA_HOME=/data/jdk-11.0.2
export PATH=$JAVA_HOME/bin:$PATH
```

--------------------------------

### Get MTEB Tasks

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/4_Evaluation/4.2.1_MTEB_Intro.ipynb

Retrieve MTEB tasks based on a provided list of task names. This example selects only the first task from the retrieval_tasks list.

```python
tasks = mteb.get_tasks(tasks=retrieval_tasks[:1])
```

--------------------------------

### Fused Adam Extension Load Time (Second Instance)

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/7_Fine-tuning/7.1.2_Fine-tune.ipynb

Reports the time taken to load the 'fused_adam' extension module for a second time. This log entry also includes warnings about attempting to get the learning rate before the scheduler has started.

```log
Time to load fused_adam op: 1.2037739753723145 seconds
[2024-12-23 06:35:06,883] [WARNING] [lr_schedules.py:683:get_lr] Attempting to get learning rate from scheduler before it has started
[2024-12-23 06:35:06,888] [WARNING] [lr_schedules.py:683:get_lr] Attempting to get learning rate from scheduler before it has started
```

--------------------------------

### Host Webpages Locally

Source: https://github.com/flagopen/flagembedding/blob/master/docs/README.md

Use this command to start a simple HTTP server on your local machine, useful for testing webpages.

```bash
python -m http.server
```

--------------------------------

### Compute Combined Scores for Text Pairs with BGE-M3

Source: https://github.com/flagopen/flagembedding/blob/master/research/BGE_M3/README.md

This example demonstrates computing scores for pairs of texts using dense, sparse, and ColBERT vectors. You can specify weights for each mode to get a weighted sum score. `max_passage_length` can be adjusted to balance latency and accuracy.

```python
from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3',  use_fp16=True) 

sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

sentence_pairs = [[i,j] for i in sentences_1 for j in sentences_2]

print(model.compute_score(sentence_pairs, 
                          max_passage_length=128, # a smaller max length leads to a lower latency
                          weights_for_different_modes=[0.4, 0.2, 0.4])) # weights_for_different_modes(w) is used to do weighted sum: w[0]*dense_score + w[1]*sparse_score + w[2]*colbert_score

# {
#   'colbert': [0.7796499729156494, 0.4621465802192688, 0.4523794651031494, 0.7898575067520142], 
#   'sparse': [0.195556640625, 0.00879669189453125, 0.0, 0.1802978515625], 
#   'dense': [0.6259765625, 0.347412109375, 0.349853515625, 0.67822265625], 
#   'sparse+dense': [0.482503205537796, 0.23454029858112335, 0.2332356721162796, 0.5122477412223816],
#   'colbert+sparse+dense': [0.6013619303703308, 0.3255828022956848, 0.32089319825172424, 0.6232916116714478]
# }
```

--------------------------------

### Install Java, Pyserini, and Faiss

Source: https://github.com/flagopen/flagembedding/blob/master/research/C_MTEB/MKQA/README.md

Install Java, Pyserini, and Faiss. Choose between CPU or GPU versions of Faiss based on your system. This is a prerequisite for both dense and hybrid retrieval.

```bash
# install java (Linux)
apt update
apt install openjdk-11-jdk

# install pyserini
pip install pyserini

# install faiss
## CPU version
conda install -c conda-forge faiss-cpu

## GPU version
conda install -c conda-forge faiss-gpu
```

--------------------------------

### Install Evaluation Dependencies

Source: https://github.com/flagopen/flagembedding/blob/master/examples/evaluation/README.md

Install pytrec_eval and faiss for model evaluation. FAISS installation may require specific GPU versions.

```shell
pip install pytrec_eval
# if you fail to install pytrec_eval, try the following command
# pip install pytrec-eval-terrier
pip install https://github.com/kyamagu/faiss-wheels/releases/download/v1.7.3/faiss_gpu-1.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
```

--------------------------------

### Create and Use IndexPQ

Source: https://github.com/flagopen/flagembedding/blob/master/docs/source/tutorial/3_Indexing/3.1.4.ipynb

Shows how to create an IndexPQ, train it, add data, and perform a search. This index uses product quantization for memory efficiency and fast approximate nearest neighbor searches.

```python
index = faiss.IndexPQ(d, M, nbits, faiss.METRIC_L2)

index.train(data)
index.add(data)

D, I = index.search(data[:1], k)

print(f"closest elements: {I}")
print(f"distance: {D}")
```

--------------------------------

### Initialize and Use LightWeightFlagLLMReranker

Source: https://github.com/flagopen/flagembedding/blob/master/examples/inference/reranker/README.md

Initialize LightWeightFlagLLMReranker with specific parameters including cutoff_layers, compress_ratio, and compress_layers. Setting use_fp16 to True can speed up computation.

```python
from FlagEmbedding import LightWeightFlagLLMReranker
reranker = LightWeightFlagLLMReranker(
    'BAAI/bge-reranker-v2.5-gemma2-lightweight',
    query_max_length=256,
    passage_max_length=512,
    use_fp16=True,
    devices=['cuda:1']
) # Setting use_fp16 to True speeds up computation with a slight performance degradation

score = reranker.compute_score(['query', 'passage'], cutoff_layers=[28], compress_ratio=2, compress_layers=[24, 40]) # Adjusting 'cutoff_layers' to pick which layers are used for computing the score.
print(score)

scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], cutoff_layers=[28], compress_ratio=2, compress_layers=[24, 40])
print(scores)
```

--------------------------------

### Install FlagEmbedding and Dependencies

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/1_Embedding/1.1_Intro&Inference.ipynb

Installs the FlagEmbedding library along with sentence-transformers, openai, and cohere. Ensure you have the necessary permissions to install packages.

```python
%%capture
%pip install -U FlagEmbedding sentence_transformers openai cohere
```

--------------------------------

### CodeRAG Evaluation Script Setup and Execution

Source: https://github.com/flagopen/flagembedding/blob/master/research/BGE_Coder/README.md

Clone the CodeRAG repository, prepare the environment and data, then run the evaluation script. Refer to the README for environment setup.

```shell
cd ./evaluation/coderag_eval
### clone coderag
git clone https://github.com/code-rag-bench/code-rag-bench.git
## You need prepare environment according to README.md
rm -rf ./code-rag-bench/retrieval/create
cp -r ./test/* ./code-rag-bench/retrieval/
### prepare data
bash prepare_data.sh
### evaluate
bash eval.sh
```

--------------------------------

### Install Environment Dependencies

Source: https://github.com/flagopen/flagembedding/blob/master/research/Reinforced_IR/README.md

Installs necessary packages for PyTorch, the usage environment, training, and evaluation. Ensure you activate the correct conda environment before installation.

```bash
conda create -n reinforced_ir python=3.10
conda activate reinforced_ir

# prepare torch
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118

# prepare usage environment
pip install -r requirements.txt
pip install transformers==4.46.0

# prepare training environment
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e " .[torch,metrics]" --no-build-isolation

# prepare evaluation environment
pip install pytrec_eval
pip install https://github.com/kyamagu/faiss-wheels/releases/download/v1.7.3/faiss_gpu-1.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
```

--------------------------------

### Install FlagEmbedding from Source (With Finetune)

Source: https://github.com/flagopen/flagembedding/blob/master/README.md

Clone the repository and install the FlagEmbedding package from source with finetune dependencies. This is necessary for users who need to modify or finetune the models.

```bash
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
# pip install  .[finetune]
```

--------------------------------

### Install Additional Core Packages

Source: https://github.com/flagopen/flagembedding/blob/master/research/visual_bge/README.md

Install essential Python packages for the project. Avoid installing xformer and apex as they are not needed for inference and may cause conflicts.

```bash
pip install torchvision timm einops ftfy
```

--------------------------------

### Initialize Ground Truth and Results Data

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/2_Metrics/2.2_Eval_Metrics.ipynb

Sets up example ground truth document IDs and search results for queries, used to demonstrate metric calculations.

```python
import numpy as np

ground_truth = [
    [11,  1,  7, 17, 21],
    [ 4, 16,  1],
    [26, 10, 22,  8],
]

results = [
    [11,  1, 17,  7, 21,  8,  0, 28,  9, 20],
    [16,  1,  6, 18,  3,  4, 25, 19,  8, 14],
    [24, 10, 26,  2,  8, 28,  4, 23, 13, 21],
]
```

--------------------------------

### Fine-tune Model with Provided Script

Source: https://github.com/flagopen/flagembedding/blob/master/research/llm_dense_retriever/README.md

This command-line example shows how to fine-tune the model using the `run.py` script. It includes various parameters for model configuration, data loading, LoRA settings, and distributed training. Adjust parameters based on your specific fine-tuning needs.

```shell
cd ./finetune
torchrun --nproc_per_node 8 \
run.py \
--output_dir ./test \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--train_data cfli/bge-e5data \
--learning_rate 1e-4 \
--num_train_epochs 1 \
--per_device_train_batch_size 16 \
--lora_alpha 64 \
--lora_rank 32 \
--dataloader_drop_last True \
--normlized True \
--temperature 0.02 \
--query_max_len 2048 \
--passage_max_len 512 \
--example_query_max_len 256 \
--example_passage_max_len 256 \
--train_group_size 8 \
--logging_steps 1 \
--save_steps 250 \
--save_total_limit 20 \
--ddp_find_unused_parameters False \
--negatives_cross_device \
--gradient_checkpointing \
--deepspeed ../../LLARA/stage1.json \
--warmup_steps 100 \
--fp16 \
--cache_dir ./cache/model_cache \
--token ... \
--cache_path ./cache/data_cache \
--sub_batch_size 64 \
--target_modules q_proj k_proj v_proj o_proj down_proj up_proj gate_proj \
--use_special_tokens \
--symmetric_batch_size 256 \
--symmetric_train_group_size 8 \
--max_class_neg 7 \
--save_merged_lora_model True
```

--------------------------------

### Initialize FlagAutoModel for Embedding Inference

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/1_Embedding/1.2.2_Auto_Embedder.ipynb

Import FlagAutoModel and initialize it using the from_finetuned() function with a model name and optional parameters like query instructions and devices. Specify 'cuda:0' for GPU acceleration or omit for automatic device selection.

```python
from FlagEmbedding import FlagAutoModel

model = FlagAutoModel.from_finetuned(
    'BAAI/bge-base-en-v1.5',
    query_instruction_for_retrieval="Represent this sentence for searching relevant passages: ",
    devices="cuda:0",   # if not specified, will use all available gpus or cpu when no gpu available
)
```

--------------------------------

### Install FlagEmbedding and Transformers

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/1_Embedding/1.2.3_BGE_v1&1.5.ipynb

Install the necessary packages for using FlagEmbedding and transformers.

```python
%%capture
%pip install -U transformers FlagEmbedding
```

--------------------------------

### Set up Evaluation Arguments

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/4_Evaluation/4.5.2_MLDR.ipynb

Prepare command-line arguments for the evaluation script. This snippet demonstrates how to format and assign arguments, simulating command-line input for the evaluation process.

```python
import sys

arguments = """- \
    --eval_name mldr \
    --dataset_dir ./mldr/data \
    --dataset_names en \
    --splits dev \
    --corpus_embd_save_dir ./mldr/corpus_embd \
    --output_dir ./mldr/search_results \
    --search_top_k 1000 \
    --cache_path ./cache/data \
    --overwrite False \
    --k_values 10 100 \
    --eval_output_method markdown \
    --eval_output_path ./mldr/mldr_eval_results.md \
    --eval_metrics ndcg_at_10 \
    --embedder_name_or_path BAAI/bge-base-en-v1.5 \
    --devices cuda:0 cuda:1 \
    --embedder_batch_size 1024
"""replace('
','')

sys.argv = arguments.split()
```

--------------------------------

### Install faiss-gpu using Conda

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/3_Indexing/3.1.2_Faiss_GPU.ipynb

Install the faiss-gpu package using Conda. Ensure you are in a Linux x86_64 environment and select the created conda environment as the kernel for your notebook. Restart the kernel after installation.

```bash
conda install -c pytorch -c nvidia faiss-gpu=1.8.0
```

--------------------------------

### Install LM-Cocktail via Pip

Source: https://github.com/flagopen/flagembedding/blob/master/research/LM_Cocktail/README.md

Install the latest version of LM-Cocktail using pip.

```bash
pip install -U LM_Cocktail
```

--------------------------------

### Initialize and Use BGE-Reranker Model

Source: https://github.com/flagopen/flagembedding/blob/master/docs/source/bge/bge_reranker.md

Demonstrates how to initialize the BGE-Reranker model with specified parameters and compute similarity scores between a query and a list of documents. Ensure the 'FlagEmbedding' library is installed and the specified device is available.

```python
from FlagEmbedding import FlagReranker

reranker = FlagReranker(
    'BAAI/bge-reranker-base',
    query_max_length=256,
    use_fp16=True,
    devices=['cuda:1'],
)

score = reranker.compute_score(['I am happy to help', 'Assisting you is my pleasure'])
print(score)
```

--------------------------------

### Set up FlagEmbedding Environment

Source: https://github.com/flagopen/flagembedding/blob/master/research/llm_dense_retriever/README.md

Install necessary packages for FlagEmbedding, including PyTorch with CUDA support. Adjust the CUDA version if needed. Flash-attn is installed without build isolation.

```bash
conda create icl python=3.10

conda activate icl

# You may need to adjust the cuda version
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers==4.41.0 deepspeed accelerate datasets peft pandas
pip install flash-attn --no-build-isolation
```

--------------------------------

### Chatbot Prompt Setup

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/6_RAG/6.1_RAG_From_Scratch.ipynb

Defines the system prompt for the chatbot, instructing it to act as a recommendation bot, be brief, and use the provided restaurant list and user preferences to suggest two restaurants.

```python
prompt="""
You are a bot that makes recommendations for restaurants. 
Please be brief, answer in short sentences without extra information.

These are the restaurants list:
{recommended_activities}

The user's preference is: {user_input}
Provide the user with 2 recommended restaurants based on the user's preference.
"""
```

--------------------------------

### Install DeepSpeed and Flash Attention

Source: https://github.com/flagopen/flagembedding/blob/master/examples/README.md

Install the deepspeed and flash-attn packages required for fine-tuning. Flash attention requires building from source without pre-built wheels.

```shell
pip install deepspeed
```

```shell
pip install flash-attn --no-build-isolation
```

--------------------------------

### Install FlagEmbedding Package

Source: https://github.com/flagopen/flagembedding/blob/master/research/llm_dense_retriever/README.md

Clone the FlagEmbedding repository and install it as an editable package in your current environment.

```bash
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install -e .
```

--------------------------------

### Index Documents with FaissVectorStore

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/6_RAG/6.3_RAG_LlamaIndex.ipynb

Initializes a Faiss index and a FaissVectorStore, then creates a StorageContext and builds a VectorStoreIndex from the loaded documents.

```python
import faiss
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.core import Settings

# init Faiss and create a vector store
faiss_index = faiss.IndexFlatL2(dim)
vector_store = FaissVectorStore(faiss_index=faiss_index)

# customize the storage context using our vector store
storage_context = StorageContext.from_defaults(
    vector_store=vector_store
)

# use the loaded documents to build the index
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
```

--------------------------------

### Install Faiss CPU

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/3_Indexing/3.1.5_Faiss_Index_Choosing.ipynb

Installs the CPU version of Faiss and necessary libraries using pip.

```python
# %pip install -U faiss-cpu numpy h5py
```

--------------------------------

### Create and Use IndexScalarQuantizer

Source: https://github.com/flagopen/flagembedding/blob/master/docs/source/tutorial/3_Indexing/3.1.4.ipynb

Shows how to create an IndexScalarQuantizer, train it, add data, and perform a search. This index uses scalar quantization for efficient storage and retrieval.

```python
d = 128
k = 3
qtype = faiss.ScalarQuantizer.QT_8bit

index = faiss.IndexScalarQuantizer(d, qtype, faiss.METRIC_L2)

index.train(data)
index.add(data)

D, I = index.search(data[:1], k)

print(f"closest elements: {I}")
print(f"distance: {D}")
```

--------------------------------

### Install FlagEmbedding Package

Source: https://github.com/flagopen/flagembedding/blob/master/Tutorials/1_Embedding/1.2.7_BGE_Code_v1.ipynb

Install the FlagEmbedding package using pip. This is required before using the model.

```python
%pip install -U FlagEmbedding
```

--------------------------------

### Set up evaluation arguments

Source: https://github.com/flagopen/flagembedding/blob/master/docs/source/tutorial/4_Evaluation/4.5.2.ipynb

Prepare evaluation arguments by formatting them as a string and splitting into sys.argv. This is useful for simulating command-line arguments in a script.

```python
import sys

arguments = """- \
    --eval_name mldr \
    --dataset_dir ./mldr/data \
    --dataset_names en \
    --splits dev \
    --corpus_embd_save_dir ./mldr/corpus_embd \
    --output_dir ./mldr/search_results \
    --search_top_k 1000 \
    --cache_path ./cache/data \
    --overwrite False \
    --k_values 10 100 \
    --eval_output_method markdown \
    --eval_output_path ./mldr/mldr_eval_results.md \
    --eval_metrics ndcg_at_10 \
    --embedder_name_or_path BAAI/bge-base-en-v1.5 \
    --devices cuda:0 cuda:1 \
    --embedder_batch_size 1024
""".replace('
','')

sys.argv = arguments.split()
```

--------------------------------

### CoIR Evaluation Script Setup and Execution

Source: https://github.com/flagopen/flagembedding/blob/master/research/BGE_Coder/README.md

Clone the CoIR repository and run the evaluation script. Ensure you are in the correct directory before executing.

```shell
cd ./evaluation/coir_eval
### clone coir
mkdir test
cd ./test
git clone https://github.com/CoIR-team/coir.git
mv ./coir/coir ../
cd ..
rm -rf ./test
### evaluate
bash eval.sh
```