# pyseekdb pyseekdb is a unified Python client library for SeekDB and OceanBase vector databases that provides simple, beginner-friendly APIs for AI applications. It abstracts away SQL complexities by treating vector data operations as key-value operations similar to MongoDB, Elasticsearch, and Milvus. The library follows a schema-free interface design where users manage text documents and their vector embeddings without explicitly defining relational table structures. This SDK is particularly valuable for RAG (Retrieval-Augmented Generation) applications, semantic search systems, and AI-powered knowledge bases. The library supports three connection modes: embedded mode for local development using pylibseekdb, remote SeekDB server mode for dedicated vector database deployments, and OceanBase server mode for enterprise multi-tenant environments. It features automatic embedding generation through configurable embedding functions, efficient HNSW (Hierarchical Navigable Small World) vector indexing, full-text search combined with semantic search via hybrid search, and comprehensive filtering with metadata and document queries. The design emphasizes ease of use with automatic dimension detection, optional embedding function configuration, and unified CRUD operations across all deployment modes. ## Client Connection - Embedded Mode Initialize a local embedded SeekDB instance for development and testing. ```python import pyseekdb # Create embedded client with explicit path client = pyseekdb.Client( path="./seekdb", database="demo" ) # Execute raw SQL if needed rows = client.execute("SELECT COUNT(*) FROM information_schema.tables") print(f"Tables: {rows}") # Create collection with auto-generated embeddings collection = client.create_collection("documents") collection.add( ids=["doc1", "doc2"], documents=["Python is popular", "Machine learning transforms AI"], metadatas=[{"lang": "python"}, {"topic": "ai"}] ) # Query by semantic similarity results = collection.query( query_texts=["programming languages"], n_results=1 ) print(f"Found: {results['documents'][0][0]}") # Output: Found: Python is popular ``` ## Client Connection - Remote SeekDB Server Connect to a remote SeekDB server for production vector search workloads. ```python import pyseekdb # Connect to SeekDB server client = pyseekdb.Client( host="127.0.0.1", port=2881, database="production", user="root", password="" # Uses SEEKDB_PASSWORD env var if empty ) # Create collection with custom configuration from pyseekdb import HNSWConfiguration, DefaultEmbeddingFunction config = HNSWConfiguration(dimension=384, distance='cosine') ef = DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2') collection = client.create_collection( name="knowledge_base", configuration=config, embedding_function=ef ) # Batch insert with automatic embedding docs = [ "Neural networks mimic brain structure", "Vector databases enable semantic search", "Python supports machine learning workflows" ] ids = [f"kb_{i}" for i in range(len(docs))] collection.add(ids=ids, documents=docs) # Semantic search with metadata filter results = collection.query( query_texts=["AI and deep learning"], where={"$or": [{"topic": "ai"}, {"topic": "ml"}]}, n_results=2 ) for i, doc_id in enumerate(results['ids'][0]): print(f"{i+1}. {results['documents'][0][i]} (distance: {results['distances'][0][i]:.4f})") ``` ## Client Connection - OceanBase Multi-Tenant Connect to OceanBase server for enterprise deployments with tenant isolation. ```python import pyseekdb import os # Set password via environment variable os.environ["SEEKDB_PASSWORD"] = "secure_password" # Connect to OceanBase with tenant client = pyseekdb.Client( host="oceanbase.example.com", port=2881, tenant="analytics", # Tenant isolation database="vectors", user="analyst", password="" # Automatically reads from SEEKDB_PASSWORD ) # Verify connection collections = client.list_collections() print(f"Available collections: {len(collections)}") # Create collection with manual embeddings (no auto-embedding) collection = client.create_collection( name="custom_vectors", configuration=HNSWConfiguration(dimension=128, distance='l2'), embedding_function=None # Manual embeddings required ) # Insert with pre-computed embeddings import random embeddings = [[random.random() for _ in range(128)] for _ in range(3)] collection.add( ids=["v1", "v2", "v3"], embeddings=embeddings, documents=["Doc one", "Doc two", "Doc three"], metadatas=[{"idx": 1}, {"idx": 2}, {"idx": 3}] ) # Vector similarity search with manual query embedding query_embedding = [random.random() for _ in range(128)] results = collection.query( query_embeddings=query_embedding, where={"idx": {"$gte": 2}}, n_results=2 ) print(f"Matched {len(results['ids'][0])} vectors") ``` ## Database Management with AdminClient Manage databases across embedded, SeekDB, and OceanBase deployments. ```python import pyseekdb # Embedded mode - database admin admin = pyseekdb.AdminClient(path="./seekdb") # Create and list databases admin.create_database("analytics") admin.create_database("staging") databases = admin.list_databases() for db in databases: print(f"Database: {db.name}, Charset: {db.charset}, Collation: {db.collation}") # Get database metadata db_info = admin.get_database("analytics") print(f"Retrieved: {db_info.name}") # OceanBase mode - tenant-aware database management admin_ob = pyseekdb.AdminClient( host="127.0.0.1", port=2881, tenant="analytics", user="admin", password="secure_pass" ) # List databases in tenant dbs = admin_ob.list_databases(tenant="analytics", limit=10, offset=0) print(f"Tenant 'analytics' has {len(dbs)} databases") # Delete database admin.delete_database("staging") # Connect to specific database for collection operations client = pyseekdb.Client(path="./seekdb", database="analytics") collection = client.create_collection("reports") print(f"Created collection in {client._client.database}") ``` ## Collection Creation and Management Create and configure collections with vector indexes and embedding functions. ```python import pyseekdb from pyseekdb import HNSWConfiguration, DefaultEmbeddingFunction client = pyseekdb.Client(database="vectors") # Create with default configuration (384-dim, cosine distance) collection_basic = client.create_collection("basic_docs") print(f"Default dimension: {collection_basic.dimension}") # Output: Default dimension: 384 # Create with custom HNSW configuration config = HNSWConfiguration(dimension=768, distance='inner_product') collection_custom = client.create_collection( name="advanced_docs", configuration=config ) # Create with custom embedding function from typing import List, Union class CustomEmbedding: @property def dimension(self) -> int: return 512 def __call__(self, input: Union[str, List[str]]) -> List[List[float]]: import random texts = [input] if isinstance(input, str) else input return [[random.random() for _ in range(512)] for _ in texts] ef = CustomEmbedding() collection_ef = client.create_collection( name="custom_embed", configuration=HNSWConfiguration(dimension=512, distance='cosine'), embedding_function=ef ) # Get or create (idempotent) collection = client.get_or_create_collection("idempotent_docs") # Check existence if client.has_collection("advanced_docs"): col = client.get_collection("advanced_docs", embedding_function=None) print(f"Collection {col.name} has {col.dimension} dimensions") # List all collections all_collections = client.list_collections() for c in all_collections: print(f"- {c.name}: {c.dimension}D, {c.distance} metric") # Count collections total = client.count_collection() print(f"Total collections: {total}") # Delete collection client.delete_collection("basic_docs") ``` ## Data Insertion with Add Operation Insert new documents with automatic or manual embedding generation. ```python import pyseekdb client = pyseekdb.Client(database="content") collection = client.create_collection("articles") # Add single item with auto-generated embedding collection.add( ids="art_001", documents="Python enables rapid AI development", metadatas={"category": "tech", "year": 2024, "rating": 4.5} ) # Add multiple items with auto-embedding articles = [ "Machine learning requires quality training data", "Vector databases optimize similarity search", "Neural networks process complex patterns", "Natural language processing understands text" ] ids = [f"art_{100+i}" for i in range(len(articles))] metadatas = [ {"category": "AI", "year": 2023, "rating": 4.8, "tags": ["ml", "data"]}, {"category": "DB", "year": 2024, "rating": 4.6, "tags": ["vectors", "search"]}, {"category": "AI", "year": 2023, "rating": 4.7, "tags": ["neural", "dl"]}, {"category": "NLP", "year": 2024, "rating": 4.9, "tags": ["text", "ai"]} ] collection.add(ids=ids, documents=articles, metadatas=metadatas) # Add with pre-computed embeddings (bypasses embedding function) import random manual_embeddings = [[random.random() for _ in range(384)] for _ in range(2)] collection.add( ids=["art_200", "art_201"], embeddings=manual_embeddings, documents=["Custom embed doc 1", "Custom embed doc 2"], metadatas=[{"source": "manual"}, {"source": "manual"}] ) # Add embeddings only (no documents) vector_only = [[random.random() for _ in range(384)] for _ in range(3)] collection.add( ids=["vec_1", "vec_2", "vec_3"], embeddings=vector_only ) # Verify insertion count = collection.count() print(f"Total articles: {count}") # Output: Total articles: 10 ``` ## Data Update and Upsert Operations Modify existing records or insert new ones with flexible update semantics. ```python import pyseekdb client = pyseekdb.Client(database="content") collection = client.get_collection("articles") # Update metadata only collection.update( ids="art_001", metadatas={"category": "tech", "year": 2024, "rating": 5.0, "featured": True} ) # Update document and embedding (auto-generated) collection.update( ids="art_100", documents="Deep learning transforms machine learning applications", metadatas={"category": "AI", "updated": True} ) # Update multiple items with new embeddings collection.update( ids=["art_101", "art_102"], documents=["Updated vector database content", "Updated neural network content"], metadatas=[{"updated": True, "version": 2}, {"updated": True, "version": 2}] ) # Upsert existing item (updates if exists) collection.upsert( ids="art_100", documents="Machine learning revolutionizes data analysis", metadatas={"category": "AI", "year": 2024, "upserted": True} ) # Upsert new item (inserts if not exists) collection.upsert( ids="art_300", documents="Transformers enable state-of-the-art NLP", metadatas={"category": "NLP", "year": 2024, "new": True} ) # Batch upsert (mix of existing and new) import random upsert_ids = ["art_101", "art_400", "art_401"] # art_101 exists, others new upsert_docs = [ "Updated: Vectors power semantic search", "New: Attention mechanisms improve models", "New: BERT revolutionized NLP tasks" ] embeddings = [[random.random() for _ in range(384)] for _ in range(3)] collection.upsert( ids=upsert_ids, embeddings=embeddings, documents=upsert_docs, metadatas=[{"op": "upsert"} for _ in range(3)] ) # Verify updates result = collection.get(ids="art_300") print(f"Upserted doc: {result['documents'][0]}") ``` ## Data Deletion with Filters Remove documents by ID, metadata filters, or document content filters. ```python import pyseekdb client = pyseekdb.Client(database="content") collection = client.get_collection("articles") # Delete by single ID collection.delete(ids="art_300") # Delete by multiple IDs collection.delete(ids=["vec_1", "vec_2", "vec_3"]) # Delete by metadata filter (equality) collection.delete(where={"source": {"$eq": "manual"}}) # Delete by comparison operator collection.delete(where={"rating": {"$lt": 4.5}}) # Delete by $in operator collection.delete(where={"category": {"$in": ["deprecated", "archived"]}}) # Delete by logical OR collection.delete( where={ "$or": [ {"year": {"$lt": 2020}}, {"rating": {"$lt": 3.0}} ] } ) # Delete by document content filter collection.delete(where_document={"$contains": "obsolete"}) # Delete with combined filters collection.delete( where={"category": {"$eq": "tech"}, "year": {"$lt": 2023}}, where_document={"$contains": "deprecated"} ) # Delete all low-rated AI articles from 2023 collection.delete( where={ "$and": [ {"category": "AI"}, {"year": {"$eq": 2023}}, {"rating": {"$lte": 4.0}} ] } ) # Verify remaining count remaining = collection.count() print(f"Remaining articles: {remaining}") ``` ## Vector Similarity Search with Query Perform semantic search using vector embeddings with metadata and document filters. ```python import pyseekdb client = pyseekdb.Client(database="knowledge") collection = client.get_collection("documents") # Basic semantic search with query text results = collection.query( query_texts="artificial intelligence and deep learning", n_results=5 ) for i in range(len(results['ids'][0])): doc_id = results['ids'][0][i] distance = results['distances'][0][i] document = results['documents'][0][i] print(f"{i+1}. [{doc_id}] {document} (distance: {distance:.4f})") # Query with manual embedding vector import random query_vector = [random.random() for _ in range(384)] results = collection.query( query_embeddings=query_vector, n_results=3 ) # Batch query with multiple texts results = collection.query( query_texts=["machine learning", "natural language processing", "computer vision"], n_results=2 ) # results['ids'][0] = top 2 for "machine learning" # results['ids'][1] = top 2 for "natural language processing" # results['ids'][2] = top 2 for "computer vision" for query_idx, query_ids in enumerate(results['ids']): print(f"Query {query_idx+1}: {len(query_ids)} results") # Query with metadata filter (simplified equality) results = collection.query( query_texts="python programming", where={"category": "tech"}, n_results=5 ) # Query with comparison operators results = collection.query( query_texts="advanced AI techniques", where={"year": {"$gte": 2023}, "rating": {"$gte": 4.5}}, n_results=3 ) # Query with $in operator results = collection.query( query_texts="data science tools", where={"tags": {"$in": ["ml", "data", "analytics"]}}, n_results=5 ) # Query with logical OR results = collection.query( query_texts="neural networks", where={ "$or": [ {"category": "AI"}, {"category": "ML"} ] }, n_results=5 ) # Query with document content filter results = collection.query( query_texts="machine learning", where_document={"$contains": "neural network"}, n_results=3 ) # Query with combined filters results = collection.query( query_texts="AI research", where={"category": "AI", "year": {"$gte": 2024}}, where_document={"$contains": "transformer"}, include=["documents", "metadatas", "embeddings"], n_results=5 ) for i in range(len(results['ids'][0])): print(f"ID: {results['ids'][0][i]}") print(f"Document: {results['documents'][0][i]}") print(f"Metadata: {results['metadatas'][0][i]}") print(f"Embedding dim: {len(results['embeddings'][0][i])}") print(f"Distance: {results['distances'][0][i]:.4f}\n") ``` ## Data Retrieval with Get Operation Retrieve documents by ID or filters without vector similarity ranking. ```python import pyseekdb client = pyseekdb.Client(database="knowledge") collection = client.get_collection("documents") # Get single document by ID result = collection.get(ids="art_001") print(f"Document: {result['documents'][0]}") print(f"Metadata: {result['metadatas'][0]}") # Get multiple documents by IDs result = collection.get(ids=["art_001", "art_100", "art_101"]) for i, doc_id in enumerate(result['ids']): print(f"{i+1}. [{doc_id}] {result['documents'][i]}") # Get by metadata filter (simplified equality) result = collection.get( where={"category": "AI"}, limit=10 ) print(f"Found {len(result['ids'])} AI documents") # Get by comparison operators result = collection.get( where={"rating": {"$gte": 4.5}, "year": {"$eq": 2024}}, limit=5 ) # Get by $in operator result = collection.get( where={"category": {"$in": ["AI", "ML", "NLP"]}}, limit=20 ) # Get by logical OR result = collection.get( where={ "$or": [ {"category": "AI"}, {"rating": {"$gte": 4.8}} ] }, limit=15 ) # Get by document content filter result = collection.get( where_document={"$contains": "machine learning"}, limit=10 ) # Get with pagination page_1 = collection.get(limit=10, offset=0) page_2 = collection.get(limit=10, offset=10) page_3 = collection.get(limit=10, offset=20) print(f"Page 1: {len(page_1['ids'])} items") print(f"Page 2: {len(page_2['ids'])} items") # Get with combined filters result = collection.get( where={"category": "AI", "year": {"$gte": 2023}}, where_document={"$contains": "neural"}, include=["documents", "metadatas", "embeddings"], limit=5 ) # Get all documents (up to limit) all_docs = collection.get(limit=1000) print(f"Total documents retrieved: {len(all_docs['ids'])}") # Get specific fields only result = collection.get( ids=["art_100", "art_101"], include=["documents", "metadatas"] # Excludes embeddings ) print(f"Has embeddings: {'embeddings' in result}") # Output: Has embeddings: False ``` ## Hybrid Search - Full-Text + Vector Fusion Combine full-text search and vector similarity with intelligent result ranking. ```python import pyseekdb client = pyseekdb.Client(database="knowledge") collection = client.get_collection("documents") # Basic hybrid search: full-text keyword + semantic similarity results = collection.hybrid_search( query={ "where_document": {"$contains": "machine learning"}, "n_results": 10 }, knn={ "query_texts": ["artificial intelligence research"], "n_results": 10 }, rank={"rrf": {}}, # Reciprocal Rank Fusion n_results=5 ) print("Top 5 hybrid results:") for i, doc_id in enumerate(results['ids'][0]): print(f"{i+1}. [{doc_id}] {results['documents'][0][i]}") # Hybrid search with independent filters results = collection.hybrid_search( query={ "where_document": {"$contains": "neural network"}, "where": {"year": {"$eq": 2024}}, # Filter for full-text search "n_results": 10 }, knn={ "query_texts": ["deep learning applications"], "where": {"rating": {"$gte": 4.5}}, # Different filter for vector search "n_results": 10 }, rank={"rrf": {"rank_window_size": 60, "rank_constant": 60}}, n_results=5, include=["documents", "metadatas", "distances"] ) # Hybrid search with batch queries results = collection.hybrid_search( query={ "where_document": {"$contains": "AI"}, "n_results": 10 }, knn={ "query_texts": ["transformers", "computer vision", "reinforcement learning"], "n_results": 10 }, rank={"rrf": {}}, n_results=3 ) # Returns combined results from all queries # Full-text only hybrid search (no vector component) results = collection.hybrid_search( query={ "where_document": {"$contains": "Python programming"}, "where": {"category": "tech"}, "n_results": 10 }, rank={"rrf": {}}, n_results=5 ) # Vector only hybrid search (no full-text component) import random query_embedding = [random.random() for _ in range(384)] results = collection.hybrid_search( knn={ "query_embeddings": [query_embedding], "where": {"year": {"$gte": 2023}}, "n_results": 10 }, rank={"rrf": {}}, n_results=5 ) # Complex multi-criteria hybrid search results = collection.hybrid_search( query={ "where_document": { "$or": [ {"$contains": "machine learning"}, {"$contains": "deep learning"} ] }, "where": {"category": "AI"}, "n_results": 15 }, knn={ "query_texts": ["neural network architectures"], "where": { "$and": [ {"year": {"$gte": 2023}}, {"rating": {"$gte": 4.0}} ] }, "n_results": 15 }, rank={"rrf": {"rank_window_size": 100, "rank_constant": 60}}, n_results=10, include=["documents", "metadatas", "embeddings", "distances"] ) for i in range(len(results['ids'][0])): print(f"\nRank {i+1}:") print(f" ID: {results['ids'][0][i]}") print(f" Doc: {results['documents'][0][i]}") print(f" Meta: {results['metadatas'][0][i]}") print(f" Distance: {results['distances'][0][i]:.4f}") ``` ## Custom Embedding Functions Implement custom embedding functions for domain-specific vector generation. ```python import pyseekdb from typing import List, Union # Example 1: Sentence-Transformers Custom Embedding class SentenceTransformerEmbedding: def __init__(self, model_name: str = "all-MiniLM-L6-v2", device: str = "cpu"): self.model_name = model_name self.device = device self._model = None self._dimension = None def _ensure_model_loaded(self): if self._model is None: from sentence_transformers import SentenceTransformer self._model = SentenceTransformer(self.model_name, device=self.device) test_embedding = self._model.encode(["test"], convert_to_numpy=True) self._dimension = len(test_embedding[0]) @property def dimension(self) -> int: self._ensure_model_loaded() return self._dimension def __call__(self, input: Union[str, List[str]]) -> List[List[float]]: self._ensure_model_loaded() if isinstance(input, str): input = [input] if not input: return [] embeddings = self._model.encode(input, convert_to_numpy=True, show_progress_bar=False) return [embedding.tolist() for embedding in embeddings] # Use custom embedding function ef = SentenceTransformerEmbedding(model_name='all-mpnet-base-v2', device='cpu') client = pyseekdb.Client(database="custom") collection = client.create_collection( name="research_papers", configuration=pyseekdb.HNSWConfiguration(dimension=ef.dimension, distance='cosine'), embedding_function=ef ) # Add documents (automatically embedded with custom function) collection.add( ids=["paper_1", "paper_2"], documents=[ "Attention mechanisms improve neural machine translation", "Convolutional neural networks excel at image classification" ], metadatas=[{"field": "NLP"}, {"field": "CV"}] ) # Example 2: OpenAI API Embedding Function import os import openai class OpenAIEmbedding: def __init__(self, model_name: str = "text-embedding-ada-002", api_key: str = None): self.model_name = model_name self.api_key = api_key or os.environ.get('OPENAI_API_KEY') if not self.api_key: raise ValueError("OpenAI API key required") self._dimension = 1536 if "ada-002" in model_name else None @property def dimension(self) -> int: if self._dimension is None: raise ValueError("Dimension not set for this model") return self._dimension def __call__(self, input: Union[str, List[str]]) -> List[List[float]]: if isinstance(input, str): input = [input] if not input: return [] response = openai.Embedding.create( model=self.model_name, input=input, api_key=self.api_key ) return [item['embedding'] for item in response['data']] # Use OpenAI embedding ef_openai = OpenAIEmbedding(model_name='text-embedding-ada-002') collection_openai = client.create_collection( name="openai_docs", configuration=pyseekdb.HNSWConfiguration(dimension=1536, distance='cosine'), embedding_function=ef_openai ) # Query with custom embedding function results = collection.query( query_texts=["machine learning models"], n_results=5 ) print(f"Found {len(results['ids'][0])} relevant papers") ``` ## Collection Information and Inspection Access collection metadata, preview data, and inspect collection properties. ```python import pyseekdb client = pyseekdb.Client(database="analytics") collection = client.get_collection("documents") # Get item count count = collection.count() print(f"Collection contains {count} documents") # Get collection properties print(f"Name: {collection.name}") print(f"ID: {collection.id}") print(f"Dimension: {collection.dimension}") print(f"Distance metric: {collection.distance}") print(f"Has embedding function: {collection.embedding_function is not None}") print(f"Metadata: {collection.metadata}") # Peek at first few items (returns all fields by default) preview = collection.peek(limit=3) for i in range(len(preview['ids'])): print(f"\nItem {i+1}:") print(f" ID: {preview['ids'][i]}") print(f" Document: {preview['documents'][i]}") print(f" Metadata: {preview['metadatas'][i]}") print(f" Embedding: {preview['embeddings'][i][:5]}... (dim={len(preview['embeddings'][i])})") # Get detailed collection information info = collection.describe() print(f"\nCollection Info:") print(f" Name: {info['name']}") print(f" Dimension: {info['dimension']}") print(f" Count: {info.get('count', 'N/A')}") # Count collections in database total_collections = client.count_collection() print(f"\nDatabase has {total_collections} collections") # List all collections with details all_collections = client.list_collections() print("\nAll collections:") for col in all_collections: print(f" - {col.name}: {col.dimension}D, {col.distance} distance") if col.embedding_function: print(f" Embedding: {col.embedding_function}") # Check if collection exists before operations if client.has_collection("documents"): col = client.get_collection("documents") data = col.get(limit=5) print(f"\nFound collection with {len(data['ids'])} sample items") else: print("\nCollection does not exist") # Get collection client reference print(f"\nClient mode: {collection.client.mode}") print(f"Client database: {collection.client.database}") ``` pyseekdb provides a production-ready vector database client that simplifies AI application development through intuitive APIs and flexible deployment options. The library is ideal for building RAG systems where documents need to be semantically searchable, knowledge bases that combine keyword and semantic search, recommendation engines powered by vector similarity, and document classification systems using embedding-based retrieval. Its automatic embedding generation reduces boilerplate code while maintaining the flexibility to use custom embedding models for specialized domains. The unified interface across embedded, SeekDB server, and OceanBase deployments enables seamless migration from development to production without code changes. Whether prototyping locally with embedded mode, deploying to dedicated vector databases with SeekDB server, or integrating with enterprise OceanBase clusters for multi-tenant isolation, pyseekdb provides consistent APIs with comprehensive error handling. The library's hybrid search capabilities combine traditional full-text search with semantic vector search, making it particularly effective for complex information retrieval scenarios where both keyword matching and conceptual similarity matter.