# OceanBase seekdb ## Introduction OceanBase seekdb is an AI-native search database that unifies relational, vector, text, JSON, and GIS data in a single engine. Built on the proven OceanBase architecture, it enables hybrid search combining vector similarity, full-text search, and traditional SQL queries within a single statement. The database provides full ACID compliance, MySQL compatibility, and supports both embedded mode for edge devices and standalone server mode for production deployments. seekdb is designed for modern AI applications requiring semantic search, RAG workflows, and multi-modal data processing. It features built-in embedding functions, in-database AI operations, and seamless integration with popular frameworks like LangChain, LlamaIndex, and HuggingFace. With support for HNSW vector indexing, full-text search with IK parser, and hybrid search capabilities, seekdb eliminates the need for multiple specialized databases while maintaining high performance and developer-friendly APIs in both Python and C++. ## APIs and Key Functions ### Python SDK: Client Connection (Embedded Mode) Create a local embedded database instance for edge computing, development, or single-node deployments without requiring a separate server process. ```python import pyseekdb # Embedded mode - runs database locally in the same process client = pyseekdb.Client( path="./seekdb.db", # Local database file path database="test" # Database name ) # The client is now ready to create collections and perform operations # Embedded mode is ideal for: # - Development and testing # - Edge devices and IoT applications # - Single-user applications # - Scenarios requiring no network overhead ``` ### Python SDK: Client Connection (Server Mode) Connect to a remote seekdb server instance for multi-user applications, distributed deployments, or production environments. ```python import pyseekdb # Server mode - connects to remote seekdb server client = pyseekdb.Client( host="127.0.0.1", port=2881, database="test", user="root", password="" ) # Alternative: OceanBase mode with tenant support client = pyseekdb.Client( host="127.0.0.1", port=2881, tenant="test", # OceanBase tenant name database="test", user="root", password="" ) # Server mode supports: # - Multi-user concurrent access # - Remote database connections # - Production workloads # - Tenant isolation in OceanBase deployments ``` ### Python SDK: Collection Creation with Automatic Embeddings Create a collection (similar to a table) with automatic embedding generation using built-in embedding functions for semantic search. ```python from pyseekdb import DefaultEmbeddingFunction # Create collection with default embedding function (384 dimensions) collection = client.create_collection( name="my_collection", embedding_function=DefaultEmbeddingFunction() ) print(f"Collection dimension: {collection.dimension}") print(f"Embedding function: {collection.embedding_function}") # The embedding function automatically converts text to vectors # No need to manually generate embeddings for documents or queries # Supports various embedding models through configuration ``` ### Python SDK: Adding Documents with Auto-Generated Embeddings Insert documents into a collection with automatic embedding generation, metadata storage, and semantic indexing. ```python # Define documents and metadata documents = [ "Machine learning is a subset of artificial intelligence", "Python is a popular programming language", "Vector databases enable semantic search", "Neural networks are inspired by the human brain", "Natural language processing helps computers understand text" ] ids = ["id1", "id2", "id3", "id4", "id5"] metadatas = [ {"category": "AI", "index": 0}, {"category": "Programming", "index": 1}, {"category": "Database", "index": 2}, {"category": "AI", "index": 3}, {"category": "NLP", "index": 4} ] # Add documents - embeddings auto-generated by embedding function collection.add( ids=ids, documents=documents, metadatas=metadatas ) print(f"Added {len(documents)} documents with auto-generated embeddings") # Documents are now searchable using semantic similarity ``` ### Python SDK: Semantic Query with Auto-Embedding Perform semantic search using natural language queries with automatic query embedding and similarity ranking. ```python # Query using natural language - no manual embedding needed query_text = "artificial intelligence and machine learning" results = collection.query( query_texts=query_text, # Text query auto-converted to vector n_results=3 # Return top 3 most similar documents ) # Process and display results print(f"Query: '{query_text}'") print(f"Found {len(results['ids'][0])} results") for i in range(len(results['ids'][0])): print(f"\nResult {i+1}:") print(f" ID: {results['ids'][0][i]}") print(f" Distance: {results['distances'][0][i]:.4f}") print(f" Document: {results['documents'][0][i]}") print(f" Metadata: {results['metadatas'][0][i]}") # Results are ranked by semantic similarity (lower distance = more similar) ``` ### Python SDK: Collection Management Delete collections to clean up resources and remove indexed data from the database. ```python # Delete a collection and all its data client.delete_collection("my_collection") print("Collection deleted successfully") # This removes: # - All documents and embeddings # - Associated indexes # - Metadata # Use with caution - operation is irreversible ``` ### SQL: Vector Search Table Creation Create tables with vector columns, vector indexes, and full-text search capabilities using standard SQL DDL. ```sql -- Create table with vector and full-text search support CREATE TABLE articles ( id INT PRIMARY KEY, title TEXT, content TEXT, embedding VECTOR(384), FULLTEXT INDEX idx_fts(content) WITH PARSER ik, VECTOR INDEX idx_vec (embedding) WITH( DISTANCE=l2, -- L2 (Euclidean) distance metric TYPE=hnsw, -- HNSW algorithm for ANN search LIB=vsag -- VSAG library for vector operations ) ) ORGANIZATION = HEAP; -- VECTOR(384): Vector column with 384 dimensions -- FULLTEXT INDEX: Full-text search with IK tokenizer -- VECTOR INDEX: Approximate nearest neighbor search -- ORGANIZATION = HEAP: Optimized for read/write performance ``` ### SQL: Insert Documents with Embeddings Insert documents with pre-computed embeddings for vector search and full-text indexing. ```sql -- Insert documents with vector embeddings -- Note: Embeddings should be pre-computed using your embedding model INSERT INTO articles (id, title, content, embedding) VALUES (1, 'AI and Machine Learning', 'Artificial intelligence is transforming industries with machine learning capabilities', '[0.123, 0.456, 0.789, ...]'), -- Replace with actual 384-dim vector (2, 'Database Systems', 'Modern databases provide high performance and scalability for data-intensive applications', '[0.234, 0.567, 0.890, ...]'), (3, 'Vector Search', 'Vector databases enable semantic search by storing embeddings and computing similarity', '[0.345, 0.678, 0.901, ...]'); -- Embeddings must match the dimension specified in table schema -- Both vector and full-text indexes are automatically updated ``` ### SQL: Hybrid Search Query Combine vector similarity search with full-text keyword matching in a single SQL query for powerful hybrid search. ```sql -- Hybrid search: Vector similarity + Full-text relevance SELECT id, title, content, l2_distance(embedding, '[0.123, 0.456, 0.789, ...]') AS vector_distance, MATCH(content) AGAINST('machine learning AI' IN NATURAL LANGUAGE MODE) AS text_score FROM articles WHERE MATCH(content) AGAINST('machine learning AI' IN NATURAL LANGUAGE MODE) ORDER BY vector_distance APPROXIMATE LIMIT 10; -- Explanation: -- 1. l2_distance(): Calculates L2 distance between stored and query vectors -- 2. MATCH...AGAINST: Full-text search with relevance scoring -- 3. WHERE clause: Filters by keyword relevance -- 4. ORDER BY...APPROXIMATE: Approximate nearest neighbor ranking -- 5. Results combine semantic similarity and keyword relevance ``` ### Python Embedded: Hybrid Search with JSON Query Execute hybrid search combining vector and full-text queries using Elasticsearch-style JSON syntax through Python embedded library. ```python import pylibseekdb as seekdb # Initialize embedded database seekdb.open() conn = seekdb.connect("test") cursor = conn.cursor() # Create table with vector and full-text indexes cursor.execute(''' CREATE TABLE doc_table( c1 INT, vector VECTOR(3), query VARCHAR(255), content VARCHAR(255), VECTOR INDEX idx1(vector) WITH(DISTANCE=l2, TYPE=hnsw, LIB=vsag), FULLTEXT idx2(query), FULLTEXT idx3(content) ) ''') # Insert sample documents cursor.execute(''' INSERT INTO doc_table VALUES (1, '[1,2,3]', "hello world", "oceanbase Elasticsearch database"), (2, '[1,2,1]', "hello world, what is your name", "oceanbase mysql database"), (3, '[1,1,1]', "hello world, how are you", "oceanbase oracle database"), (4, '[1,3,1]', "real world, where are you from", "postgres oracle database"), (5, '[1,3,2]', "real world, how old are you", "redis oracle database"), (6, '[2,1,1]', "hello world, where are you from", "starrocks oceanbase database") ''') conn.commit() # Hybrid search using Elasticsearch-style JSON syntax cursor.execute(''' SET @parm = '{ "query": { "bool": { "should": [ {"match": {"query": "hi hello"}}, {"match": {"content": "oceanbase mysql"}} ] } }, "knn": { "field": "vector", "k": 5, "query_vector": [1,2,3] }, "_source": ["query", "content", "_keyword_score", "_semantic_score"] }' ''') conn.commit() # Execute hybrid search cursor.execute(''' SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm)) ''') results = cursor.fetchall() print(results) # Returns JSON with documents ranked by combined keyword and semantic scores ``` ### C++ API: Table Operations with ObTable Interface Perform CRUD operations using the C++ ObTable API for high-performance direct table access. ```cpp #include "libobtable.h" using namespace oceanbase::table; using namespace oceanbase::common; int main() { int ret = OB_SUCCESS; // 1. Initialize library ret = ObTableServiceLibrary::init(); // 2. Initialize client ObTableServiceClient* p_service_client = ObTableServiceClient::alloc_client(); ObTableServiceClient &service_client = *p_service_client; ret = service_client.init( ObString::make_string("127.0.0.1"), // host 2881, // port 2881, // rpc_port ObString::make_string("sys"), // tenant ObString::make_string("root"), // user ObString::make_string(""), // password ObString::make_string("test"), // database ObString::make_string("") // cluster ); // 3. Allocate table instance ObTable* table = NULL; ret = service_client.alloc_table(ObString::make_string("t2"), table); // 4. INSERT operation ObObj key_objs[3]; key_objs[0].set_varbinary("abc"); key_objs[1].set_varchar("cq"); key_objs[1].set_collation_type(CS_TYPE_UTF8MB4_GENERAL_CI); key_objs[2].set_int(1); ObRowkey rk(key_objs, 3); ObTableEntity entity; entity.set_rowkey(rk); ObObj value; value.set_varchar("value1"); value.set_collation_type(CS_TYPE_UTF8MB4_GENERAL_CI); entity.set_property("v1", value); value.set_int(123); entity.set_property("v2", value); ObTableOperation table_op = ObTableOperation::insert(entity); ObTableOperationResult result; ret = table->execute(table_op, result); // 5. GET operation ObTableEntity entity_get; entity_get.set_rowkey(rk); ObObj null_obj; entity_get.set_property(ObString::make_string("v1"), null_obj); entity_get.set_property(ObString::make_string("v2"), null_obj); ObTableOperation get_op = ObTableOperation::retrieve(entity_get); ObTableOperationResult result_get; ret = table->execute(get_op, result_get); // 6. UPDATE operation entity.reset(); entity.set_rowkey(rk); value.set_int(666); entity.set_property("v2", value); table_op = ObTableOperation::update(entity); ret = table->execute(table_op, result); // 7. DELETE operation table_op = ObTableOperation::del(entity); ret = table->execute(table_op, result); // 8. Cleanup service_client.free_table(table); service_client.destroy(); ObTableServiceClient::free_client(p_service_client); ObTableServiceLibrary::destroy(); return ret; } // Supports: insert, update, replace, insert_or_update, delete, retrieve // High-performance direct access bypassing SQL layer ``` ### C++ API: Key-Value Store with ObPStore Use the ObPStore interface for HBase-style column-family key-value operations with versioning support. ```cpp #include "libobtable.h" using namespace oceanbase::table; using namespace oceanbase::common; int main() { int ret = OB_SUCCESS; // Initialize library and client (same as ObTable example) ret = ObTableServiceLibrary::init(); ObTableServiceClient* p_service_client = ObTableServiceClient::alloc_client(); ObTableServiceClient &service_client = *p_service_client; ret = service_client.init( ObString::make_string("127.0.0.1"), 2881, 2881, ObString::make_string("sys"), ObString::make_string("root"), ObString::make_string(""), ObString::make_string("test"), ObString::make_string("") ); // Initialize PStore ObPStore pstore; ret = pstore.init(service_client); // PUT operation (write key-value with version) ObHKVTable::Key key; ObHKVTable::Value value; key.rowkey_ = ObString::make_string("abc"); key.column_qualifier_ = ObString::make_string("cq"); key.version_ = 123; // Timestamp/version value.set_varchar(ObString::make_string("value1")); ret = pstore.put( ObString::make_string("t4"), // table name ObString::make_string("cf1"), // column family key, value ); // GET operation ObHKVTable::Value value_out; ret = pstore.get( ObString::make_string("t4"), ObString::make_string("cf1"), key, value_out ); // REMOVE operation ret = pstore.remove( ObString::make_string("t4"), ObString::make_string("cf1"), key ); // MULTI-PUT (batch operation) ObHKVTable::Keys keys; ObHKVTable::Values values; for (int64_t i = 0; i < 16; ++i) { key.rowkey_ = ObString::make_string("abc"); key.column_qualifier_ = ObString::make_string("cq"); key.version_ = 123 + i; keys.push_back(key); value.set_varchar(ObString::make_string("value1")); values.push_back(value); } ret = pstore.multi_put( ObString::make_string("t4"), ObString::make_string("cf1"), keys, values ); // MULTI-GET (batch retrieval) ObHKVTable::Values values_out; ret = pstore.multi_get( ObString::make_string("t4"), ObString::make_string("cf1"), keys, values_out ); // MULTI-REMOVE (batch deletion) ret = pstore.multi_remove( ObString::make_string("t4"), ObString::make_string("cf1"), keys ); // Cleanup service_client.destroy(); ObTableServiceClient::free_client(p_service_client); ObTableServiceLibrary::destroy(); return 0; } // PStore provides HBase-compatible API with: // - Row key, column qualifier, and version (timestamp) // - Batch operations for high throughput // - Column family organization ``` ### Docker Deployment Deploy seekdb in containerized environments for quick testing, development, or production workloads. ```bash # Run seekdb in Docker with persistent storage docker run -d \ --name seekdb \ -p 2881:2881 \ -v ./data:/var/lib/oceanbase/store \ oceanbase/seekdb:latest # Container exposes: # - Port 2881: MySQL protocol and seekdb API # - Volume mount: Persistent data storage # Connect using MySQL client mysql -h 127.0.0.1 -P 2881 -u root -p test # Or use Python SDK import pyseekdb client = pyseekdb.Client(host="127.0.0.1", port=2881, database="test", user="root", password="") ``` ### Binary Installation and Build Install seekdb from RPM packages or build from source for production deployments on Linux systems. ```bash # Install from RPM (production) rpm -ivh seekdb-1.0.0.0-xxxxxxx.el8.x86_64.rpm # Or build from source (development) git clone https://github.com/oceanbase/seekdb.git cd seekdb # Build in debug mode with all dependencies bash build.sh debug --init --make # Setup runtime directory mkdir -p ~/seekdb/bin cp build_debug/src/observer/observer ~/seekdb/bin cd ~/seekdb # Start the server ./bin/observer # Server starts on default port 2881 # Ready to accept connections from MySQL clients or SDK # Configuration: Use fresh directory for testing # Logs: Located in ~/seekdb/log/ ``` ## Summary and Integration Patterns seekdb serves as a unified data platform for AI-native applications requiring semantic search, full-text retrieval, and traditional database operations. Primary use cases include RAG systems for enterprise knowledge bases, semantic search engines for e-commerce and content platforms, agentic AI applications with memory and context management, and edge AI deployments on resource-constrained devices. The database excels in scenarios where multiple data modalities (structured, unstructured, vector) must be queried together, eliminating the complexity of maintaining separate specialized databases for each data type. Integration patterns leverage seekdb's dual-interface architecture: Python SDK for rapid AI application development with frameworks like LangChain and LlamaIndex, and SQL interface for enterprise application integration using standard database drivers. The embedded mode enables offline AI applications on edge devices, mobile platforms, and development environments, while server mode provides multi-tenant, high-concurrency access for production workloads. C++ APIs offer low-level control for performance-critical applications. seekdb's MySQL compatibility ensures seamless migration from existing MySQL-based systems, while built-in embedding functions and hybrid search eliminate the need for external vector processing pipelines, enabling true document-in/data-out RAG workflows entirely within the database layer.