### ContrastiveTrainer Setup Source: https://context7.com/illuin-tech/colpali/llms.txt Initializes the ContrastiveTrainer for custom training loops, supporting multi-GPU setups. Requires model, processor, collator, and loss function. ```python import torch from colpali_engine.trainer.contrastive_trainer import ContrastiveTrainer from colpali_engine.collators import VisualRetrieverCollator from colpali_engine.loss.late_interaction_losses import ColbertLoss from colpali_engine.models import ColQwen2, ColQwen2Processor from transformers import TrainingArguments # Setup model = ColQwen2.from_pretrained("vidore/colqwen2-v1.0", torch_dtype=torch.bfloat16) processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v1.0") # Collator and loss collator = VisualRetrieverCollator(processor=processor) loss_func = ColbertLoss(temperature=0.02, normalize_scores=True) ``` -------------------------------- ### SLURM Cluster Training Examples Source: https://github.com/illuin-tech/colpali/blob/main/README.md Submits training jobs to a SLURM cluster. The first example configures a single GPU job with specific resources, while the second example requests multiple GPUs with different constraints. ```bash sbatch --nodes=1 --cpus-per-task=16 --mem-per-cpu=32GB --time=20:00:00 --gres=gpu:1 -p gpua100 --job-name=colidefics --output=colidefics.out --error=colidefics.err --wrap="accelerate launch scripts/train/train_colbert.py scripts/configs/pali/train_colpali_docmatix_hardneg_model.yaml" ``` ```bash sbatch --nodes=1 --time=5:00:00 -A cad15443 --gres=gpu:8 --constraint=MI250 --job-name=colpali --wrap="accelerate launch --multi-gpu scripts/configs/qwen2/train_colqwen25_model.py" ``` -------------------------------- ### Install ColPali Engine Source: https://context7.com/illuin-tech/colpali/llms.txt Install the ColPali engine from PyPI or source. Additional dependencies for training or interpretability tools can be included. ```bash pip install colpali-engine ``` ```bash pip install git+https://github.com/illuin-tech/colpali ``` ```bash pip install "colpali-engine[train]" ``` ```bash pip install "colpali-engine[interpretability]" ``` ```bash pip install "colpali-engine[all]" ``` -------------------------------- ### Local Training Example Source: https://github.com/illuin-tech/colpali/blob/main/README.md Launches the ColPali training script for local execution, potentially utilizing multiple GPUs. Ensure 'accelerate' is configured correctly for your environment. ```bash accelerate launch --multi-gpu scripts/configs/qwen2/train_colqwen25_model.py ``` -------------------------------- ### Install Colpali Engine with Interpretability Source: https://github.com/illuin-tech/colpali/blob/main/README.md Install the Colpali engine with interpretability features enabled. This is required for generating similarity maps. ```bash pip install colpali-engine[interpretability] ``` -------------------------------- ### Install ColPali Training Dependencies Source: https://github.com/illuin-tech/colpali/blob/main/README.md Install the essential packages for using the ColPali training script. This command ensures all necessary dependencies for training are available. ```bash pip install "colpali-engine[train]" ``` -------------------------------- ### Quick Start with ColQwen2 Source: https://github.com/illuin-tech/colpali/blob/main/README.md Load the ColQwen2 model and processor, prepare image and query inputs, and generate embeddings. Ensure flash attention 2 is available for optimized performance. ```python import torch from PIL import Image from transformers.utils.import_utils import is_flash_attn_2_available from colpali_engine.models import ColQwen2, ColQwen2Processor model_name = "vidore/colqwen2-v1.0" model = ColQwen2.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="cuda:0", # or "mps" if on Apple Silicon attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None, ).eval() processor = ColQwen2Processor.from_pretrained(model_name) # Your inputs images = [ Image.new("RGB", (128, 128), color="white"), Image.new("RGB", (64, 32), color="black"), ] queries = [ "What is the organizational structure for our R&D department?", "Can you provide a breakdown of last year’s financial performance?", ] # Process the inputs batch_images = processor.process_images(images).to(model.device) batch_queries = processor.process_queries(queries).to(model.device) # Forward pass with torch.no_grad(): image_embeddings = model(**batch_images) query_embeddings = model(**batch_queries) scores = processor.score_multi_vector(query_embeddings, image_embeddings) ``` -------------------------------- ### Install ColPali Engine Source: https://github.com/illuin-tech/colpali/blob/main/README.md Install a specific version of the colpali-engine package. This is useful for reproducing results from a particular release. ```bash pip install colpali-engine==0.1.1 ``` -------------------------------- ### Install All ColPali Optional Dependencies for Testing Source: https://github.com/illuin-tech/colpali/blob/main/README.md Installs all optional dependencies for ColPali to ensure comprehensive test discovery and execution. This is necessary to avoid errors during test runs. ```bash pip install "colpali-engine[all]" ``` -------------------------------- ### Install ColPali Development Dependencies Source: https://github.com/illuin-tech/colpali/blob/main/README.md Installs development dependencies for ColPali, enabling proper testing and linting. This is required for contributing to the project. ```bash pip install "colpali-engine[dev]" ``` -------------------------------- ### Install Colpali Engine Source: https://github.com/illuin-tech/colpali/blob/main/README.md Install the Colpali engine package from PyPi or directly from source. Ensure to use a version above v0.2.0 for ColPali versions above v1.0. ```bash pip install colpali-engine # from PyPi ``` ```bash pip install git+https://github.com/illuin-tech/colpali # from source ``` -------------------------------- ### Create a Corpus and Dataset Source: https://context7.com/illuin-tech/colpali/llms.txt Demonstrates how to create a Corpus from data and then initialize a ColPaliEngineDataset using this corpus. ```python corpus_data = [{"doc": f"document_{i}"} for i in range(100)] corpus = Corpus( corpus_data=corpus_data, doc_column_name="doc", ) # Dataset with external corpus train_data = [ {"query": "query 1", "pos_target": 0, "neg_target": [1, 2, 3]}, {"query": "query 2", "pos_target": 5, "neg_target": [6, 7, 8]}, ] dataset_with_corpus = ColPaliEngineDataset( data=train_data, corpus=corpus, query_column_name="query", pos_target_column_name="pos_target", neg_target_column_name="neg_target", ) ``` -------------------------------- ### Full Training Pipeline with ColModelTraining Source: https://context7.com/illuin-tech/colpali/llms.txt Sets up and runs a complete training pipeline using HuggingFace Trainer for contrastive learning. Configure datasets, loss functions, and optional LoRA for fine-tuning. ```python import torch from datasets import load_dataset from peft import LoraConfig from transformers import TrainingArguments from colpali_engine.data.dataset import ColPaliEngineDataset from colpali_engine.loss.late_interaction_losses import ColbertLoss from colpali_engine.models import ColQwen2, ColQwen2Processor from colpali_engine.trainer.colmodel_training import ( ColModelTraining, ColModelTrainingConfig, ) # Load model and processor processor = ColQwen2Processor.from_pretrained( "vidore/colqwen2-v1.0", max_num_visual_tokens=768, ) model = ColQwen2.from_pretrained( "vidore/colqwen2-v1.0", torch_dtype=torch.bfloat16, use_cache=False, attn_implementation="flash_attention_2", ) # Prepare datasets train_hf = load_dataset("your-dataset", split="train") eval_hf = load_dataset("your-dataset", split="validation") train_dataset = ColPaliEngineDataset( data=train_hf, query_column_name="query", pos_target_column_name="image", ) eval_dataset = ColPaliEngineDataset( data=eval_hf, query_column_name="query", pos_target_column_name="image", ) # Configure training config = ColModelTrainingConfig( output_dir="./models/my-colqwen2", processor=processor, model=model, train_dataset=train_dataset, eval_dataset=eval_dataset, run_eval=True, loss_func=ColbertLoss( temperature=0.02, normalize_scores=True, ), tr_args=TrainingArguments( output_dir=None, # Will use config.output_dir num_train_epochs=3, per_device_train_batch_size=32, gradient_checkpointing=True, gradient_checkpointing_kwargs={"use_reentrant": False}, learning_rate=2e-4, warmup_steps=100, logging_steps=10, save_steps=500, eval_strategy="steps", eval_steps=100, ), # Optional: Use LoRA for efficient fine-tuning peft_config=LoraConfig( r=32, lora_alpha=32, lora_dropout=0.1, bias="none", task_type="FEATURE_EXTRACTION", target_modules=r"(.*(model).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$|.*(custom_text_proj).*$)", ), ) # Train and save trainer = ColModelTraining(config) trainer.train() trainer.save() ``` -------------------------------- ### Create and Train ContrastiveTrainer Source: https://context7.com/illuin-tech/colpali/llms.txt Instantiate and train a ContrastiveTrainer for visual document retrieval tasks. Ensure necessary datasets, collator, loss function, and training arguments are provided. ```python trainer = ContrastiveTrainer( model=model, train_dataset=train_dataset, # ColPaliEngineDataset eval_dataset=eval_dataset, data_collator=collator, loss_func=loss_func, is_vision_model=True, compute_symetric_loss=False, # Optional: bidirectional loss args=TrainingArguments( output_dir="./output", per_device_train_batch_size=16, num_train_epochs=3, learning_rate=2e-4, ), ) # Train trainer.train() ``` -------------------------------- ### Generate Similarity Maps for Interpretability Source: https://github.com/illuin-tech/colpali/blob/main/README.md Load the ColPali model and processor, preprocess an image and query, and generate similarity maps to visualize model focus zones. ```python import torch from PIL import Image from colpali_engine.interpretability import ( get_similarity_maps_from_embeddings, plot_all_similarity_maps, ) from colpali_engine.models import ColPali, ColPaliProcessor from colpali_engine.utils.torch_utils import get_torch_device model_name = "vidore/colpali-v1.3" device = get_torch_device("auto") # Load the model model = ColPali.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map=device, ).eval() # Load the processor processor = ColPaliProcessor.from_pretrained(model_name) # Load the image and query image = Image.open("shift_kazakhstan.jpg") query = "Quelle partie de la production pétrolière du Kazakhstan provient de champs en mer ?" # Preprocess inputs batch_images = processor.process_images([image]).to(device) batch_queries = processor.process_queries([query]).to(device) ``` -------------------------------- ### Fast-Plaid Index Creation and Querying Source: https://github.com/illuin-tech/colpali/blob/main/README.md Utilize fast-plaid for quicker matching with larger corpus sizes. Process images in batches and create a plaid index for efficient similarity scoring. ```python # !pip install --no-deps fast-plaid fastkmeans # Process the inputs by batches of 4 dataloader = DataLoader( dataset=images, batch_size=4, shuffle=False, collate_fn=lambda x: processor.process_images(x), ) ds = [] for batch_doc in tqdm(dataloader): with torch.no_grad(): batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()} embeddings_doc = model(**batch_doc) ds.extend(list(torch.unbind(embeddings_doc.to("cpu")))) plaid_index = processor.create_plaid_index(ds) scores = processor.get_topk_plaid(query_embeddings, plaid_index, k=10) ``` -------------------------------- ### Load ColQwen2 Model and Processor Source: https://context7.com/illuin-tech/colpali/llms.txt Load the ColQwen2 vision retriever model and its corresponding processor. Supports optional flash attention for performance. Ensure CUDA or MPS is available for GPU acceleration. ```python import torch from PIL import Image from transformers.utils.import_utils import is_flash_attn_2_available from colpali_engine.models import ColQwen2, ColQwen2Processor model_name = "vidore/colqwen2-v1.0" # Load the model with optional flash attention model = ColQwen2.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="cuda:0", # or "mps" for Apple Silicon attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None, ).eval() # Load the processor processor = ColQwen2Processor.from_pretrained(model_name) # Example: get embedding dimension print(f"Embedding dimension: {model.dim}") # Output: 128 print(f"Patch size: {model.patch_size}") ``` -------------------------------- ### VisualRetrieverCollator for Batching Source: https://context7.com/illuin-tech/colpali/llms.txt Prepares batches of queries and images for training vision retrieval models. Ensure the processor is loaded correctly and samples contain image data for positive targets. ```python from colpali_engine.collators import VisualRetrieverCollator from colpali_engine.models import ColQwen2Processor from PIL import Image processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v1.0") # Create the collator collator = VisualRetrieverCollator( processor=processor, max_length=2048, ) # Example batch of training samples samples = [ { "query": "What is the revenue?", "pos_target": [Image.new("RGB", (800, 600), "white")], "neg_target": None, }, { "query": "Show the organizational chart", "pos_target": [Image.new("RGB", (800, 600), "lightgray")], "neg_target": None, }, ] # Collate into model-ready batch batch = collator(samples) print("Batch keys:", list(batch.keys())) # Output: ['query_input_ids', 'query_attention_mask', 'doc_input_ids', # 'doc_attention_mask', 'doc_pixel_values', ...] print(f"Query input shape: {batch['query_input_ids'].shape}") print(f"Doc input shape: {batch['doc_input_ids'].shape}") ``` -------------------------------- ### ViDoRe Benchmark Citation (arXiv) Source: https://github.com/illuin-tech/colpali/blob/main/README.md BibTeX entry for the ViDoRe Benchmark V2 paper. Use this to cite the benchmark in academic contexts. ```latex @misc{macé2025vidorebenchmarkv2raising, title={ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval}, author={Quentin Macé and António Loison and Manuel Faysse}, year={2025}, eprint={2505.17166}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2505.17166}, } ``` -------------------------------- ### Load and Use ColIdefics3 (SmolVLM) Model Source: https://context7.com/illuin-tech/colpali/llms.txt Load a ColIdefics3 (SmolVLM) model for resource-constrained environments. Process images and queries to generate embeddings and compute similarity scores using multi-vector scoring. ```python import torch from PIL import Image from colpali_engine.models import ColIdefics3, ColIdefics3Processor # ColSmol variants: 256M or 500M parameters model_name = "vidore/colSmol-500M" model = ColIdefics3.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="cuda:0", ).eval() processor = ColIdefics3Processor.from_pretrained(model_name) # Process inputs images = [Image.new("RGB", (800, 600), "white")] queries = ["Find the quarterly results"] batch_images = processor.process_images(images).to(model.device) batch_queries = processor.process_queries(queries).to(model.device) with torch.no_grad(): image_embeddings = model(**batch_images) query_embeddings = model(**batch_queries) scores = processor.score_multi_vector(query_embeddings, image_embeddings) print(f"Score: {scores[0, 0].item():.4f}") ``` -------------------------------- ### Load and Use BiQwen2 Bi-Encoder Model Source: https://context7.com/illuin-tech/colpali/llms.txt Load a BiQwen2 bi-encoder model for faster retrieval. Process images and queries to generate single-vector embeddings and compute similarity scores using dot product. ```python import torch from PIL import Image from colpali_engine.models import BiQwen2, BiQwen2Processor # Load bi-encoder model model = BiQwen2.from_pretrained( "vidore/biqwen2-v1.0", torch_dtype=torch.bfloat16, device_map="cuda:0", ).eval() processor = BiQwen2Processor.from_pretrained("vidore/biqwen2-v1.0") # Process inputs images = [Image.new("RGB", (800, 600), "white")] queries = ["What is the revenue?"] batch_images = processor.process_images(images).to(model.device) batch_queries = processor.process_queries(queries).to(model.device) # Generate single-vector embeddings with torch.no_grad(): image_embeddings = model(**batch_images) # (batch, dim) query_embeddings = model(**batch_queries) # (batch, dim) # Simple dot product scoring scores = processor.score_single_vector(query_embeddings, image_embeddings) print(f"Similarity score: {scores[0, 0].item():.4f}") ``` -------------------------------- ### Generate Similarity Maps for Interpretability Source: https://context7.com/illuin-tech/colpali/llms.txt Visualize model focus for each query token on document images using similarity maps. This involves generating embeddings, calculating similarity maps, and plotting them. ```python import torch from PIL import Image from colpali_engine.models import ColPali, ColPaliProcessor from colpali_engine.interpretability import ( get_similarity_maps_from_embeddings, plot_all_similarity_maps, ) model_name = "vidore/colpali-v1.3" model = ColPali.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="cuda:0", ).eval() processor = ColPaliProcessor.from_pretrained(model_name) # Load a document image image = Image.open("document.jpg") query = "What is the total revenue?" # Process inputs batch_images = processor.process_images([image]).to(model.device) batch_queries = processor.process_queries([query]).to(model.device) # Generate embeddings with torch.no_grad(): image_embeddings = model(**batch_images) query_embeddings = model(**batch_queries) # Get the number of patches n_patches = processor.get_n_patches( image_size=image.size, patch_size=model.patch_size, ) # Get image mask image_mask = processor.get_image_mask(batch_images) # Generate similarity maps similarity_maps_batch = get_similarity_maps_from_embeddings( image_embeddings=image_embeddings, query_embeddings=query_embeddings, n_patches=n_patches, image_mask=image_mask, ) # Get maps for our image (first in batch) similarity_maps = similarity_maps_batch[0] # (query_length, n_patches_x, n_patches_y) # Tokenize query for labels query_tokens = processor.tokenizer.tokenize(query) # Plot similarity maps for each token plots = plot_all_similarity_maps( image=image, query_tokens=query_tokens, similarity_maps=similarity_maps, figsize=(8, 8), show_colorbar=True, ) # Save each plot for idx, (fig, ax) in enumerate(plots): fig.savefig(f"similarity_map_{idx}.png") ``` -------------------------------- ### Process Text Queries with ColQwen2 Source: https://context7.com/illuin-tech/colpali/llms.txt Prepare text queries using the `process_queries` method from ColQwen2Processor. This method automatically adds augmentation tokens for improved retrieval performance. Embeddings are then generated from the processed queries. ```python import torch from colpali_engine.models import ColQwen2, ColQwen2Processor model = ColQwen2.from_pretrained( "vidore/colqwen2-v1.0", torch_dtype=torch.bfloat16, device_map="cuda:0", ).eval() processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v1.0") # Define queries queries = [ "What is the organizational structure for our R&D department?", "Can you provide a breakdown of last year's financial performance?", "Show me the quarterly revenue chart", ] # Process queries (adds augmentation tokens automatically) batch_queries = processor.process_queries(queries).to(model.device) # Generate query embeddings with torch.no_grad(): query_embeddings = model(**batch_queries) print(f"Query embeddings shape: {query_embeddings.shape}") # Output: torch.Size([3, query_seq_len, 128]) ``` -------------------------------- ### ColPaliEngineDataset for Training Data Source: https://context7.com/illuin-tech/colpali/llms.txt A PyTorch Dataset class for loading query-document pairs, optionally including hard negatives. Requires `datasets` and `colpali_engine`. ```python from datasets import load_dataset from colpali_engine.data.dataset import ColPaliEngineDataset, Corpus # Load a HuggingFace dataset hf_dataset = load_dataset("vidore/docvqa_test_subsampled", split="test") # Create training dataset directly from HF dataset train_dataset = ColPaliEngineDataset( data=hf_dataset, query_column_name="query", # Column containing query text pos_target_column_name="image", # Column containing document images neg_target_column_name=None, # Optional: column with hard negative IDs num_negatives=3, # Max negatives to sample per query ) print(f"Dataset size: {len(train_dataset)}") # Access a sample sample = train_dataset[0] print(f"Query: {sample['query']}") print(f"Positive target type: {type(sample['pos_target'])}") ``` -------------------------------- ### Process Document Images with ColQwen2 Source: https://context7.com/illuin-tech/colpali/llms.txt Use the `process_images` method from the ColQwen2Processor to convert PIL images into model-ready tensors. These tensors are then used to generate multi-vector embeddings. ```python import torch from PIL import Image from colpali_engine.models import ColQwen2, ColQwen2Processor model = ColQwen2.from_pretrained( "vidore/colqwen2-v1.0", torch_dtype=torch.bfloat16, device_map="cuda:0", ).eval() processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v1.0") # Create sample images (or load real document images) images = [ Image.new("RGB", (800, 600), color="white"), Image.new("RGB", (1024, 768), color="lightgray"), ] # Process images into batched tensors batch_images = processor.process_images(images).to(model.device) # Generate multi-vector embeddings with torch.no_grad(): image_embeddings = model(**batch_images) print(f"Image embeddings shape: {image_embeddings.shape}") # Output: torch.Size([2, seq_len, 128]) ``` -------------------------------- ### Load ColPali Model and Processor Source: https://context7.com/illuin-tech/colpali/llms.txt Load the original ColPali model based on PaliGemma and its processor. This model is designed for document retrieval using vision language models. ```python import torch from colpali_engine.models import ColPali, ColPaliProcessor model_name = "vidore/colpali-v1.3" # Load model model = ColPali.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="cuda:0", ).eval() # Load processor processor = ColPaliProcessor.from_pretrained(model_name) # Model properties print(f"Embedding dimension: {model.dim}") # Output: 128 print(f"Patch size: {model.patch_size}") ``` -------------------------------- ### Generate Similarity Maps with ColPali Source: https://github.com/illuin-tech/colpali/blob/main/README.md Use this code to generate similarity maps by performing forward passes on image and query batches, then processing embeddings to visualize query-token relevance. ```python with torch.no_grad(): image_embeddings = model.forward(**batch_images) query_embeddings = model.forward(**batch_queries) n_patches = processor.get_n_patches(image_size=image.size, patch_size=model.patch_size) image_mask = processor.get_image_mask(batch_images) batched_similarity_maps = get_similarity_maps_from_embeddings( image_embeddings=image_embeddings, query_embeddings=query_embeddings, n_patches=n_patches, image_mask=image_mask, ) similarity_maps = batched_similarity_maps[0] # (query_length, n_patches_x, n_patches_y) query_tokens = processor.tokenizer.tokenize(query) plots = plot_all_similarity_maps( image=image, query_tokens=query_tokens, similarity_maps=similarity_maps, ) for idx, (fig, ax) in enumerate(plots): fig.savefig(f"similarity_map_{idx}.png") ``` -------------------------------- ### ColbertLoss for Training Retrieval Models Source: https://context7.com/illuin-tech/colpali/llms.txt An InfoNCE-style loss function for training late interaction retrieval models using in-batch negatives. Requires `torch` and `colpali_engine`. ```python import torch from colpali_engine.loss.late_interaction_losses import ColbertLoss # Initialize the loss function loss_func = ColbertLoss( temperature=0.02, # Scaling factor for logits normalize_scores=True, # Normalize by query length use_smooth_max=False, # Use amax instead of log-sum-exp pos_aware_negative_filtering=False, # Filter false negatives ) # Simulated batch of embeddings batch_size = 8 query_length = 32 doc_length = 1024 dim = 128 query_embeddings = torch.randn(batch_size, query_length, dim) doc_embeddings = torch.randn(batch_size, doc_length, dim) # L2 normalize embeddings (as done in the model) query_embeddings = query_embeddings / query_embeddings.norm(dim=-1, keepdim=True) doc_embeddings = doc_embeddings / doc_embeddings.norm(dim=-1, keepdim=True) # Compute loss (diagonal elements are positive pairs) loss = loss_func( query_embeddings=query_embeddings, doc_embeddings=doc_embeddings, offset=0, # For multi-GPU training offset ) print(f"Training loss: {loss.item():.4f}") ``` -------------------------------- ### Token Pooling with HierarchicalTokenPooler (Padded Tensor Input) Source: https://github.com/illuin-tech/colpali/blob/main/README.md Pool embeddings from padded 3D tensor inputs using HierarchicalTokenPooler. Set padding=True and provide the tokenizer's padding_side for correct padding removal before pooling. ```python import torch from PIL import Image from transformers.utils.import_utils import is_flash_attn_2_available from colpali_engine.compression.token_pooling import HierarchicalTokenPooler from colpali_engine.models import ColQwen2, ColQwen2Processor model_name = "vidore/colqwen2-v1.0" model = ColQwen2.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="cuda:0", # or "mps" if on Apple Silicon attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None, ).eval() processor = ColQwen2Processor.from_pretrained(model_name) token_pooler = HierarchicalTokenPooler() # Your page images images = [ Image.new("RGB", (128, 128), color="white"), Image.new("RGB", (32, 32), color="black"), ] # Process the inputs batch_images = processor.process_images(images).to(model.device) # Forward pass with torch.no_grad(): image_embeddings = model(**batch_images) # Apply token pooling (reduces the sequence length of the multi-vector embeddings) image_embeddings = token_pooler.pool_embeddings( image_embeddings, pool_factor=2, padding=True, padding_side=processor.tokenizer.padding_side, ) ``` -------------------------------- ### ColPali Paper Citation (arXiv) Source: https://github.com/illuin-tech/colpali/blob/main/README.md BibTeX entry for the ColPali paper. Use this to cite the work in academic contexts. ```latex @misc{faysse2024colpaliefficientdocumentretrieval, title={ColPali: Efficient Document Retrieval with Vision Language Models}, author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo}, year={2024}, eprint={2407.01449}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2407.01449}, } ``` -------------------------------- ### Token Pooling for Embedding Compression Source: https://context7.com/illuin-tech/colpali/llms.txt Reduces multi-vector embedding size using hierarchical token pooling. Requires `torch`, `PIL`, and `colpali_engine`. ```python import torch from PIL import Image from colpali_engine.models import ColQwen2, ColQwen2Processor from colpali_engine.compression.token_pooling import HierarchicalTokenPooler model = ColQwen2.from_pretrained( "vidore/colqwen2-v1.0", torch_dtype=torch.bfloat16, device_map="cuda:0", ).eval() processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v1.0") # Create the token pooler token_pooler = HierarchicalTokenPooler() # Process images images = [ Image.new("RGB", (800, 600), color="white"), Image.new("RGB", (1024, 768), color="lightgray"), ] batch_images = processor.process_images(images).to(model.device) # Generate embeddings with torch.no_grad(): image_embeddings = model(**batch_images) print(f"Original shape: {image_embeddings.shape}") # Apply token pooling with pool_factor=2 (reduces by ~50%) pooled_embeddings = token_pooler.pool_embeddings( image_embeddings, pool_factor=2, padding=True, padding_side=processor.tokenizer.padding_side, ) print(f"Pooled embeddings: {len(pooled_embeddings)} tensors") for i, emb in enumerate(pooled_embeddings): print(f" Document {i}: {emb.shape}") # Example with pool_factor=3 (reduces by ~66.7%, retains ~97.8% performance) pooled_3x = token_pooler.pool_embeddings( image_embeddings, pool_factor=3, padding=True, padding_side=processor.tokenizer.padding_side, ) ``` -------------------------------- ### Token Pooling with HierarchicalTokenPooler (List Input) Source: https://github.com/illuin-tech/colpali/blob/main/README.md Pool embeddings from a list of 2D tensors using HierarchicalTokenPooler. Specify the pool_factor to control the compression level. ```python import torch from colpali_engine.compression.token_pooling import HierarchicalTokenPooler # Dummy multivector embeddings list_embeddings = [ torch.rand(10, 768), torch.rand(20, 768), ] # Define the pooler with the desired level of compression pooler = HierarchicalTokenPooler() # Pool the embeddings outputs = pooler.pool_embeddings(list_embeddings, pool_factor=2) ``` -------------------------------- ### ColbertNegativeCELoss with Hard Negatives Source: https://context7.com/illuin-tech/colpali/llms.txt A loss function that incorporates explicit hard negatives for improved training of retrieval models. Requires `torch` and `colpali_engine`. ```python import torch from colpali_engine.loss.late_interaction_losses import ColbertNegativeCELoss # Initialize loss with hard negative support loss_func = ColbertNegativeCELoss( temperature=0.02, normalize_scores=True, in_batch_term_weight=0.5, # Weight for in-batch negatives (0-1) ) # Simulated embeddings batch_size = 8 query_length = 32 doc_length = 1024 num_negatives = 3 dim = 128 query_embeddings = torch.randn(batch_size, query_length, dim) doc_embeddings = torch.randn(batch_size, doc_length, dim) neg_doc_embeddings = torch.randn(batch_size, num_negatives, doc_length, dim) # Normalize query_embeddings = query_embeddings / query_embeddings.norm(dim=-1, keepdim=True) doc_embeddings = doc_embeddings / doc_embeddings.norm(dim=-1, keepdim=True) neg_doc_embeddings = neg_doc_embeddings / neg_doc_embeddings.norm(dim=-1, keepdim=True) # Compute loss with explicit negatives loss = loss_func( query_embeddings=query_embeddings, doc_embeddings=doc_embeddings, neg_doc_embeddings=neg_doc_embeddings, offset=0, ) print(f"Loss with hard negatives: {loss.item():.4f}") ``` -------------------------------- ### Large-Scale Retrieval with FastPlaid Source: https://context7.com/illuin-tech/colpali/llms.txt Utilize FastPlaid for efficient approximate search over multi-vector embeddings in large document collections. This involves creating a FastPlaid index from document embeddings and then querying it. ```python import torch from PIL import Image from torch.utils.data import DataLoader from tqdm import tqdm from colpali_engine.models import ColQwen2, ColQwen2Processor # pip install --no-deps fast-plaid fastkmeans model = ColQwen2.from_pretrained( "vidore/colqwen2-v1.0", torch_dtype=torch.bfloat16, device_map="cuda:0", ).eval() processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v1.0") # Create a large document corpus images = [Image.new("RGB", (800, 600), color="white") for _ in range(100)] # Process documents in batches dataloader = DataLoader( dataset=images, batch_size=4, shuffle=False, collate_fn=lambda x: processor.process_images(x), ) doc_embeddings = [] for batch_doc in tqdm(dataloader, desc="Embedding documents"): with torch.no_grad(): batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()} embeddings = model(**batch_doc) doc_embeddings.extend(list(torch.unbind(embeddings.to("cpu")))) # Create FastPlaid index plaid_index = processor.create_plaid_index(doc_embeddings) # Process queries queries = ["Find revenue information", "Show organizational structure"] batch_queries = processor.process_queries(queries).to(model.device) with torch.no_grad(): query_embeddings = model(**batch_queries) # Search using the index top_k_results = processor.get_topk_plaid( query_embeddings.cpu(), plaid_index, k=10 ) print(f"Top-10 results for each query: {top_k_results}") ``` -------------------------------- ### Compute ColBERT-style Similarity Scores Source: https://context7.com/illuin-tech/colpali/llms.txt Use `score_multi_vector` to compute late interaction scores between query and document embeddings. Ensure model and processor are loaded and inputs are processed before calling this method. ```python import torch from PIL import Image from colpali_engine.models import ColQwen2, ColQwen2Processor model = ColQwen2.from_pretrained( "vidore/colqwen2-v1.0", torch_dtype=torch.bfloat16, device_map="cuda:0", ).eval() processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v1.0") # Sample data images = [ Image.new("RGB", (800, 600), color="white"), Image.new("RGB", (1024, 768), color="lightgray"), ] queries = [ "What is the revenue breakdown?", "Show organizational chart", ] # Process inputs batch_images = processor.process_images(images).to(model.device) batch_queries = processor.process_queries(queries).to(model.device) # Generate embeddings with torch.no_grad(): image_embeddings = model(**batch_images) query_embeddings = model(**batch_queries) # Compute MaxSim scores (late interaction) # Returns a (n_queries, n_documents) score matrix scores = processor.score_multi_vector(query_embeddings, image_embeddings) print(f"Scores shape: {scores.shape}") # Output: torch.Size([2, 2]) print(f"Scores:\n{scores}") # Higher scores indicate better query-document matches ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.