Chroma (chroma-core/chroma)

Chroma

https://github.com/chroma-core/chroma
Admin
Chroma is the open-source embedding database, offering the fastest way to build LLM apps with memory...

Tokens:229,065
Snippets:892
Trust Score:8.7
Update:1 week ago
Show doc for...
Context Summary (auto-generated)
Raw
# ChromaDB

ChromaDB is the open-source AI-native vector database designed for building AI applications. It provides everything needed to store embeddings, documents, and metadata, enabling efficient similarity search and retrieval. ChromaDB handles tokenization, embedding, and indexing automatically, making it simple to build semantic search, RAG (Retrieval-Augmented Generation), and other AI-powered applications.

The core API is remarkably simple, consisting of just four main operations: create a collection, add data, query for similar items, and manage your data. ChromaDB supports multiple client types including in-memory for prototyping, persistent for local development, HTTP client for client-server deployments, and CloudClient for Chroma Cloud. It offers native support for Python, TypeScript/JavaScript, and Rust, with automatic embedding generation using built-in embedding functions or custom providers like OpenAI, Cohere, and more.

## Client Initialization

ChromaDB provides several client types for different deployment scenarios. The in-memory client is perfect for prototyping, PersistentClient saves data to disk, HttpClient connects to a Chroma server, and CloudClient connects to Chroma Cloud.

```python
import chromadb

# In-memory client (ephemeral, data lost on restart)
client = chromadb.Client()

# Persistent client (data saved to disk)
client = chromadb.PersistentClient(path="./chroma_db")

# HTTP client (connect to a running Chroma server)
client = chromadb.HttpClient(host="localhost", port=8000)

# Cloud client (connect to Chroma Cloud)
client = chromadb.CloudClient(
    tenant="your-tenant-id",
    database="your-database",
    api_key="your-api-key"
)

# Check server connection
heartbeat = client.heartbeat()  # Returns nanosecond timestamp
print(f"Server heartbeat: {heartbeat}")
```

## Create Collection

Collections are the fundamental unit of storage in ChromaDB. They store embeddings, documents, and metadata. Collection names must be 3-512 characters, start/end with alphanumeric characters, and be unique within a database.

```python
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

client = chromadb.Client()

# Create a basic collection (uses default sentence-transformer embedding)
collection = client.create_collection(name="my_documents")

# Create collection with custom embedding function
collection = client.create_collection(
    name="openai_collection",
    embedding_function=OpenAIEmbeddingFunction(
        api_key="your-openai-api-key",
        model_name="text-embedding-3-small"
    ),
    metadata={
        "description": "Documents embedded with OpenAI",
        "created": "2024-01-15"
    }
)

# Get or create (idempotent - won't fail if exists)
collection = client.get_or_create_collection(name="my_collection")

# Get existing collection
collection = client.get_collection(name="my_collection")

# List all collections with pagination
collections = client.list_collections(limit=100, offset=0)

# Count collections
count = client.count_collections()

# Delete collection (destructive, cannot be undone)
client.delete_collection(name="old_collection")
```

## Add Data to Collection

Add documents, embeddings, and metadata to a collection. Each record requires a unique string ID. ChromaDB automatically generates embeddings from documents if not provided.

```python
import chromadb

client = chromadb.Client()
collection = client.get_or_create_collection(name="books")

# Add documents (embeddings generated automatically)
collection.add(
    ids=["doc1", "doc2", "doc3"],
    documents=[
        "The quick brown fox jumps over the lazy dog",
        "Machine learning is transforming industries",
        "Vector databases enable semantic search"
    ],
    metadatas=[
        {"source": "tutorial", "chapter": 1, "tags": ["animals", "classic"]},
        {"source": "article", "chapter": 2, "year": 2024},
        {"source": "documentation", "chapter": 3, "priority": 1}
    ]
)

# Add with pre-computed embeddings
collection.add(
    ids=["vec1", "vec2"],
    embeddings=[[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8]],
    documents=["First document", "Second document"],
    metadatas=[{"type": "example"}, {"type": "example"}]
)

# Add embeddings only (no documents, useful for external document storage)
collection.add(
    ids=["emb1", "emb2"],
    embeddings=[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]],
    metadatas=[{"external_id": "ext-001"}, {"external_id": "ext-002"}]
)
```

## Query Collection

Query a collection to find the most similar documents using semantic similarity search. Supports text queries, embedding queries, metadata filtering, and document content filtering.

```python
import chromadb

client = chromadb.Client()
collection = client.get_collection(name="books")

# Basic text query (embeddings generated automatically)
results = collection.query(
    query_texts=["What is machine learning?"],
    n_results=5
)
print(f"IDs: {results['ids']}")
print(f"Documents: {results['documents']}")
print(f"Distances: {results['distances']}")

# Query with embeddings directly
results = collection.query(
    query_embeddings=[[0.1, 0.2, 0.3, 0.4]],
    n_results=10
)

# Batch query (multiple queries at once)
results = collection.query(
    query_texts=["first query", "second query"],
    n_results=5
)
# Results are grouped by query: results['ids'][0] for first query, results['ids'][1] for second

# Query with metadata filter
results = collection.query(
    query_texts=["search term"],
    n_results=10,
    where={"chapter": {"$gt": 1}},  # Only chapters > 1
    where_document={"$contains": "keyword"}  # Document must contain "keyword"
)

# Choose what fields to return
results = collection.query(
    query_texts=["my query"],
    n_results=5,
    include=["documents", "metadatas", "embeddings", "distances"]
)

# Constrain search to specific IDs
results = collection.query(
    query_texts=["query"],
    n_results=5,
    ids=["doc1", "doc2", "doc3"]  # Only search within these IDs
)
```

## Get Records by ID

Retrieve records by ID or filters without similarity ranking. Useful for fetching specific documents or paginating through a collection.

```python
import chromadb

client = chromadb.Client()
collection = client.get_collection(name="books")

# Get by specific IDs
results = collection.get(ids=["doc1", "doc2"])
print(f"IDs: {results['ids']}")
print(f"Documents: {results['documents']}")
print(f"Metadatas: {results['metadatas']}")

# Get with pagination
results = collection.get(limit=100, offset=0)

# Get with metadata filter
results = collection.get(
    where={"source": "tutorial"}
)

# Get with document content filter
results = collection.get(
    where_document={"$contains": "machine learning"}
)

# Combine filters
results = collection.get(
    where={"chapter": {"$gte": 2}},
    where_document={"$contains": "search"},
    limit=50
)

# Choose which fields to return
results = collection.get(
    ids=["doc1"],
    include=["documents", "embeddings", "metadatas"]
)

# Convenience methods
count = collection.count()  # Total records in collection
preview = collection.peek(limit=10)  # First 10 records
```

## Update and Upsert Records

Update existing records or use upsert to update if exists, insert if not.

```python
import chromadb

client = chromadb.Client()
collection = client.get_collection(name="books")

# Update existing records (fails silently if ID not found)
collection.update(
    ids=["doc1", "doc2"],
    documents=["Updated document 1", "Updated document 2"],
    metadatas=[{"updated": True, "version": 2}, {"updated": True, "version": 2}]
)

# Update embeddings directly
collection.update(
    ids=["doc1"],
    embeddings=[[0.9, 0.8, 0.7, 0.6]]
)

# Upsert: update if exists, insert if not
collection.upsert(
    ids=["doc1", "new_doc"],
    documents=["Updated or new document 1", "Brand new document"],
    metadatas=[{"status": "upserted"}, {"status": "upserted"}]
)

# Upsert with embeddings
collection.upsert(
    ids=["vec1", "vec2"],
    embeddings=[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]],
    documents=["Doc 1", "Doc 2"],
    metadatas=[{"type": "vector"}, {"type": "vector"}]
)
```

## Delete Records

Delete records from a collection by ID or using filters. This operation is destructive and cannot be undone.

```python
import chromadb

client = chromadb.Client()
collection = client.get_collection(name="books")

# Delete by specific IDs
collection.delete(ids=["doc1", "doc2", "doc3"])

# Delete by metadata filter (deletes all matching records)
collection.delete(
    where={"chapter": "20"}
)

# Delete by document content filter
collection.delete(
    where_document={"$contains": "deprecated"}
)

# Combine ID and filter (deletes records matching both)
collection.delete(
    ids=["doc1", "doc2"],
    where={"status": "archived"}
)
```

## Metadata Filtering

Filter query and get results using metadata conditions with comparison operators, logical operators, and inclusion operators.

```python
import chromadb

client = chromadb.Client()
collection = client.get_collection(name="products")

# Comparison operators: $eq, $ne, $gt, $gte, $lt, $lte
results = collection.query(
    query_texts=["laptop"],
    where={"price": {"$lt": 1000}}  # price less than 1000
)

results = collection.query(
    query_texts=["laptop"],
    where={"rating": {"$gte": 4.5}}  # rating >= 4.5
)

# Equality (shorthand)
results = collection.get(where={"category": "electronics"})
# Equivalent to:
results = collection.get(where={"category": {"$eq": "electronics"}})

# Logical operators: $and, $or
results = collection.query(
    query_texts=["phone"],
    where={
        "$and": [
            {"price": {"$gte": 500}},
            {"price": {"$lte": 1000}},
            {"brand": "Apple"}
        ]
    }
)

results = collection.get(
    where={
        "$or": [
            {"color": "red"},
            {"color": "blue"}
        ]
    }
)

# Inclusion operators: $in, $nin
results = collection.get(
    where={"author": {"$in": ["Rowling", "Tolkien", "Martin"]}}
)

results = collection.get(
    where={"status": {"$nin": ["deleted", "archived"]}}
)

# Array metadata with $contains and $not_contains
collection.add(
    ids=["movie1", "movie2"],
    documents=["Action movie", "Drama movie"],
    metadatas=[
        {"genres": ["action", "thriller"], "year": 2023},
        {"genres": ["drama", "romance"], "year": 2024}
    ]
)

results = collection.get(
    where={"genres": {"$contains": "action"}}
)

results = collection.get(
    where={"genres": {"$not_contains": "horror"}}
)

# Combine metadata and document filters
results = collection.query(
    query_texts=["exciting story"],
    where={"year": {"$gte": 2020}},
    where_document={"$contains": "adventure"}
)
```

## Document Content Filtering

Filter documents by their text content using contains, regex, and logical operators.

```python
import chromadb

client = chromadb.Client()
collection = client.get_collection(name="articles")

# Contains filter
results = collection.get(
    where_document={"$contains": "machine learning"}
)

# Not contains filter
results = collection.get(
    where_document={"$not_contains": "deprecated"}
)

# Regex filter
results = collection.get(
    where_document={"$regex": "chapter \\d+"}  # Matches "chapter 1", "chapter 2", etc.
)

# Not regex filter
results = collection.get(
    where_document={"$not_regex": "^DRAFT:"}  # Exclude documents starting with "DRAFT:"
)

# Combine with logical operators
results = collection.get(
    where_document={
        "$and": [
            {"$contains": "python"},
            {"$not_contains": "deprecated"}
        ]
    }
)

# Combine with query and metadata filter
results = collection.query(
    query_texts=["programming tutorials"],
    n_results=10,
    where={"category": "tech"},
    where_document={
        "$or": [
            {"$contains": "beginner"},
            {"$contains": "tutorial"}
        ]
    }
)
```

## Modify Collection

Update a collection's name or metadata after creation.

```python
import chromadb

client = chromadb.Client()
collection = client.get_collection(name="my_collection")

# Modify collection name
collection.modify(name="renamed_collection")

# Modify collection metadata
collection.modify(
    metadata={
        "description": "Updated description",
        "last_modified": "2024-01-15"
    }
)

# Modify both name and metadata
collection.modify(
    name="new_name",
    metadata={"version": "2.0"}
)
```

## TypeScript Client

ChromaDB provides a TypeScript/JavaScript client that connects to a running Chroma server.

```typescript
import { ChromaClient, CloudClient } from "chromadb";

// Connect to local server
const client = new ChromaClient({
  host: "localhost",
  port: 8000,
});

// Or connect to Chroma Cloud
const cloudClient = new CloudClient({
  tenant: "your-tenant",
  database: "your-database",
  apiKey: "your-api-key",
});

// Create collection
const collection = await client.createCollection({
  name: "my_collection",
  metadata: { description: "My documents" },
});

// Add documents
await collection.add({
  ids: ["id1", "id2"],
  documents: ["Document about cats", "Document about dogs"],
  metadatas: [{ animal: "cat" }, { animal: "dog" }],
});

// Query
const results = await collection.query({
  queryTexts: ["pets"],
  nResults: 5,
  where: { animal: "cat" },
});

// Iterate over results
for (const batch of results.rows()) {
  for (const row of batch) {
    console.log(row.id, row.document, row.metadata, row.distance);
  }
}

// Get by ID
const docs = await collection.get({
  ids: ["id1"],
  include: ["documents", "metadatas"],
});

// Update
await collection.update({
  ids: ["id1"],
  documents: ["Updated document about cats"],
  metadatas: [{ animal: "cat", updated: true }],
});

// Upsert
await collection.upsert({
  ids: ["id1", "id3"],
  documents: ["Upserted doc 1", "New doc 3"],
  metadatas: [{ status: "upserted" }, { status: "new" }],
});

// Delete
await collection.delete({ ids: ["id1"] });

// List collections
const collections = await client.listCollections({ limit: 100 });

// Delete collection
await client.deleteCollection({ name: "my_collection" });
```

## Rust Client

The Rust client connects to a running Chroma server and requires embeddings to be provided directly.

```rust
use chroma::{ChromaHttpClient, ChromaHttpClientOptions};
use chroma_types::{IncludeList, Include, Where, MetadataExpression, MetadataComparison, PrimitiveOperator, MetadataValue};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Connect to local server
    let client = ChromaHttpClient::new(Default::default());

    // Or connect to Chroma Cloud
    let options = ChromaHttpClientOptions::cloud("api-key", "database-name")?;
    let cloud_client = ChromaHttpClient::new(options);

    // Create collection
    let collection = client
        .create_collection("my_collection", None, None)
        .await?;

    // Add documents with embeddings (must provide embeddings directly)
    collection.add(
        vec!["id1".to_string(), "id2".to_string()],
        vec![vec![0.1, 0.2, 0.3], vec![0.4, 0.5, 0.6]],
        Some(vec![
            Some("Document about cats".to_string()),
            Some("Document about dogs".to_string()),
        ]),
        None, // uris
        None, // metadatas
    ).await?;

    // Query with embeddings
    let results = collection
        .query(
            vec![vec![0.1, 0.2, 0.3]], // query embeddings
            Some(5),                    // n_results
            None,                       // where
            None,                       // ids
            None,                       // include
        )
        .await?;

    // Query with filter
    let where_clause = Where::Metadata(MetadataExpression {
        key: "animal".to_string(),
        comparison: MetadataComparison::Primitive(
            PrimitiveOperator::Equal,
            MetadataValue::Str("cat".to_string()),
        ),
    });

    let filtered_results = collection
        .query(
            vec![vec![0.1, 0.2, 0.3]],
            Some(10),
            Some(where_clause),
            None,
            Some(IncludeList(vec![Include::Document, Include::Metadata])),
        )
        .await?;

    // Get by IDs
    let docs = collection
        .get(
            Some(vec!["id1".to_string()]),
            None,
            Some(10),
            Some(0),
            Some(IncludeList::default_get()),
        )
        .await?;

    // Delete
    collection.delete(
        Some(vec!["id1".to_string()]),
        None,
    ).await?;

    Ok(())
}
```

## Running Chroma Server

Run a Chroma server for client-server deployments.

```bash
# Install chromadb
pip install chromadb

# Run server with persistent storage
chroma run --path /path/to/db --host 0.0.0.0 --port 8000

# Run with Docker
docker pull chromadb/chroma
docker run -p 8000:8000 -v /path/to/db:/chroma/chroma chromadb/chroma

# Using npm/npx (for TypeScript projects)
npx chroma run --path ./chroma_data
```

## Summary

ChromaDB excels at powering AI applications that require semantic search and retrieval. Its primary use cases include Retrieval-Augmented Generation (RAG) systems where relevant context is retrieved from a knowledge base to augment LLM prompts, semantic search applications that find similar documents based on meaning rather than keywords, recommendation systems that suggest similar items based on embeddings, and document Q&A systems that answer questions using a corpus of documents.

Integration with ChromaDB follows straightforward patterns: initialize a client based on your deployment (in-memory for development, persistent for local production, HTTP/Cloud for distributed systems), create collections with optional embedding functions, add your data with documents and metadata, and query using text or embeddings with optional filters. The column-major result format allows efficient batch processing of query results. For production deployments, Chroma Cloud provides a fully managed, serverless solution with automatic scaling and $5 in free credits to get started.