# OceanBase seekdb

## Introduction

OceanBase seekdb is an AI-native search database that unifies relational, vector, text, JSON, and GIS data in a single engine. Built on the proven OceanBase architecture, it enables hybrid search combining vector similarity, full-text search, and traditional SQL queries within a single statement. The database provides full ACID compliance, MySQL compatibility, and supports both embedded mode for edge devices and standalone server mode for production deployments.

seekdb is designed for modern AI applications requiring semantic search, RAG workflows, and multi-modal data processing. It features built-in embedding functions, in-database AI operations, and seamless integration with popular frameworks like LangChain, LlamaIndex, and HuggingFace. With support for HNSW vector indexing, full-text search with IK parser, and hybrid search capabilities, seekdb eliminates the need for multiple specialized databases while maintaining high performance and developer-friendly APIs in both Python and C++.

## APIs and Key Functions

### Python SDK: Client Connection (Embedded Mode)

Create a local embedded database instance for edge computing, development, or single-node deployments without requiring a separate server process.

```python
import pyseekdb

# Embedded mode - runs database locally in the same process
client = pyseekdb.Client(
    path="./seekdb.db",  # Local database file path
    database="test"       # Database name
)

# The client is now ready to create collections and perform operations
# Embedded mode is ideal for:
# - Development and testing
# - Edge devices and IoT applications
# - Single-user applications
# - Scenarios requiring no network overhead
```

### Python SDK: Client Connection (Server Mode)

Connect to a remote seekdb server instance for multi-user applications, distributed deployments, or production environments.

```python
import pyseekdb

# Server mode - connects to remote seekdb server
client = pyseekdb.Client(
    host="127.0.0.1",
    port=2881,
    database="test",
    user="root",
    password=""
)

# Alternative: OceanBase mode with tenant support
client = pyseekdb.Client(
    host="127.0.0.1",
    port=2881,
    tenant="test",      # OceanBase tenant name
    database="test",
    user="root",
    password=""
)

# Server mode supports:
# - Multi-user concurrent access
# - Remote database connections
# - Production workloads
# - Tenant isolation in OceanBase deployments
```

### Python SDK: Collection Creation with Automatic Embeddings

Create a collection (similar to a table) with automatic embedding generation using built-in embedding functions for semantic search.

```python
from pyseekdb import DefaultEmbeddingFunction

# Create collection with default embedding function (384 dimensions)
collection = client.create_collection(
    name="my_collection",
    embedding_function=DefaultEmbeddingFunction()
)

print(f"Collection dimension: {collection.dimension}")
print(f"Embedding function: {collection.embedding_function}")

# The embedding function automatically converts text to vectors
# No need to manually generate embeddings for documents or queries
# Supports various embedding models through configuration
```

### Python SDK: Adding Documents with Auto-Generated Embeddings

Insert documents into a collection with automatic embedding generation, metadata storage, and semantic indexing.

```python
# Define documents and metadata
documents = [
    "Machine learning is a subset of artificial intelligence",
    "Python is a popular programming language",
    "Vector databases enable semantic search",
    "Neural networks are inspired by the human brain",
    "Natural language processing helps computers understand text"
]

ids = ["id1", "id2", "id3", "id4", "id5"]

metadatas = [
    {"category": "AI", "index": 0},
    {"category": "Programming", "index": 1},
    {"category": "Database", "index": 2},
    {"category": "AI", "index": 3},
    {"category": "NLP", "index": 4}
]

# Add documents - embeddings auto-generated by embedding function
collection.add(
    ids=ids,
    documents=documents,
    metadatas=metadatas
)

print(f"Added {len(documents)} documents with auto-generated embeddings")
# Documents are now searchable using semantic similarity
```

### Python SDK: Semantic Query with Auto-Embedding

Perform semantic search using natural language queries with automatic query embedding and similarity ranking.

```python
# Query using natural language - no manual embedding needed
query_text = "artificial intelligence and machine learning"

results = collection.query(
    query_texts=query_text,  # Text query auto-converted to vector
    n_results=3              # Return top 3 most similar documents
)

# Process and display results
print(f"Query: '{query_text}'")
print(f"Found {len(results['ids'][0])} results")

for i in range(len(results['ids'][0])):
    print(f"\nResult {i+1}:")
    print(f"  ID: {results['ids'][0][i]}")
    print(f"  Distance: {results['distances'][0][i]:.4f}")
    print(f"  Document: {results['documents'][0][i]}")
    print(f"  Metadata: {results['metadatas'][0][i]}")

# Results are ranked by semantic similarity (lower distance = more similar)
```

### Python SDK: Collection Management

Delete collections to clean up resources and remove indexed data from the database.

```python
# Delete a collection and all its data
client.delete_collection("my_collection")

print("Collection deleted successfully")

# This removes:
# - All documents and embeddings
# - Associated indexes
# - Metadata
# Use with caution - operation is irreversible
```

### SQL: Vector Search Table Creation

Create tables with vector columns, vector indexes, and full-text search capabilities using standard SQL DDL.

```sql
-- Create table with vector and full-text search support
CREATE TABLE articles (
    id INT PRIMARY KEY,
    title TEXT,
    content TEXT,
    embedding VECTOR(384),
    FULLTEXT INDEX idx_fts(content) WITH PARSER ik,
    VECTOR INDEX idx_vec (embedding) WITH(
        DISTANCE=l2,      -- L2 (Euclidean) distance metric
        TYPE=hnsw,        -- HNSW algorithm for ANN search
        LIB=vsag          -- VSAG library for vector operations
    )
) ORGANIZATION = HEAP;

-- VECTOR(384): Vector column with 384 dimensions
-- FULLTEXT INDEX: Full-text search with IK tokenizer
-- VECTOR INDEX: Approximate nearest neighbor search
-- ORGANIZATION = HEAP: Optimized for read/write performance
```

### SQL: Insert Documents with Embeddings

Insert documents with pre-computed embeddings for vector search and full-text indexing.

```sql
-- Insert documents with vector embeddings
-- Note: Embeddings should be pre-computed using your embedding model
INSERT INTO articles (id, title, content, embedding)
VALUES
    (1, 'AI and Machine Learning',
     'Artificial intelligence is transforming industries with machine learning capabilities',
     '[0.123, 0.456, 0.789, ...]'),  -- Replace with actual 384-dim vector

    (2, 'Database Systems',
     'Modern databases provide high performance and scalability for data-intensive applications',
     '[0.234, 0.567, 0.890, ...]'),

    (3, 'Vector Search',
     'Vector databases enable semantic search by storing embeddings and computing similarity',
     '[0.345, 0.678, 0.901, ...]');

-- Embeddings must match the dimension specified in table schema
-- Both vector and full-text indexes are automatically updated
```

### SQL: Hybrid Search Query

Combine vector similarity search with full-text keyword matching in a single SQL query for powerful hybrid search.

```sql
-- Hybrid search: Vector similarity + Full-text relevance
SELECT
    id,
    title,
    content,
    l2_distance(embedding, '[0.123, 0.456, 0.789, ...]') AS vector_distance,
    MATCH(content) AGAINST('machine learning AI' IN NATURAL LANGUAGE MODE) AS text_score
FROM articles
WHERE MATCH(content) AGAINST('machine learning AI' IN NATURAL LANGUAGE MODE)
ORDER BY vector_distance APPROXIMATE
LIMIT 10;

-- Explanation:
-- 1. l2_distance(): Calculates L2 distance between stored and query vectors
-- 2. MATCH...AGAINST: Full-text search with relevance scoring
-- 3. WHERE clause: Filters by keyword relevance
-- 4. ORDER BY...APPROXIMATE: Approximate nearest neighbor ranking
-- 5. Results combine semantic similarity and keyword relevance
```

### Python Embedded: Hybrid Search with JSON Query

Execute hybrid search combining vector and full-text queries using Elasticsearch-style JSON syntax through Python embedded library.

```python
import pylibseekdb as seekdb

# Initialize embedded database
seekdb.open()
conn = seekdb.connect("test")
cursor = conn.cursor()

# Create table with vector and full-text indexes
cursor.execute('''
    CREATE TABLE doc_table(
        c1 INT,
        vector VECTOR(3),
        query VARCHAR(255),
        content VARCHAR(255),
        VECTOR INDEX idx1(vector) WITH(DISTANCE=l2, TYPE=hnsw, LIB=vsag),
        FULLTEXT idx2(query),
        FULLTEXT idx3(content)
    )
''')

# Insert sample documents
cursor.execute('''
    INSERT INTO doc_table VALUES
    (1, '[1,2,3]', "hello world", "oceanbase Elasticsearch database"),
    (2, '[1,2,1]', "hello world, what is your name", "oceanbase mysql database"),
    (3, '[1,1,1]', "hello world, how are you", "oceanbase oracle database"),
    (4, '[1,3,1]', "real world, where are you from", "postgres oracle database"),
    (5, '[1,3,2]', "real world, how old are you", "redis oracle database"),
    (6, '[2,1,1]', "hello world, where are you from", "starrocks oceanbase database")
''')
conn.commit()

# Hybrid search using Elasticsearch-style JSON syntax
cursor.execute('''
    SET @parm = '{
        "query": {
            "bool": {
                "should": [
                    {"match": {"query": "hi hello"}},
                    {"match": {"content": "oceanbase mysql"}}
                ]
            }
        },
        "knn": {
            "field": "vector",
            "k": 5,
            "query_vector": [1,2,3]
        },
        "_source": ["query", "content", "_keyword_score", "_semantic_score"]
    }'
''')
conn.commit()

# Execute hybrid search
cursor.execute('''
    SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm))
''')
results = cursor.fetchall()
print(results)

# Returns JSON with documents ranked by combined keyword and semantic scores
```

### C++ API: Table Operations with ObTable Interface

Perform CRUD operations using the C++ ObTable API for high-performance direct table access.

```cpp
#include "libobtable.h"
using namespace oceanbase::table;
using namespace oceanbase::common;

int main() {
    int ret = OB_SUCCESS;

    // 1. Initialize library
    ret = ObTableServiceLibrary::init();

    // 2. Initialize client
    ObTableServiceClient* p_service_client = ObTableServiceClient::alloc_client();
    ObTableServiceClient &service_client = *p_service_client;

    ret = service_client.init(
        ObString::make_string("127.0.0.1"),  // host
        2881,                                  // port
        2881,                                  // rpc_port
        ObString::make_string("sys"),         // tenant
        ObString::make_string("root"),        // user
        ObString::make_string(""),            // password
        ObString::make_string("test"),        // database
        ObString::make_string("")             // cluster
    );

    // 3. Allocate table instance
    ObTable* table = NULL;
    ret = service_client.alloc_table(ObString::make_string("t2"), table);

    // 4. INSERT operation
    ObObj key_objs[3];
    key_objs[0].set_varbinary("abc");
    key_objs[1].set_varchar("cq");
    key_objs[1].set_collation_type(CS_TYPE_UTF8MB4_GENERAL_CI);
    key_objs[2].set_int(1);
    ObRowkey rk(key_objs, 3);

    ObTableEntity entity;
    entity.set_rowkey(rk);

    ObObj value;
    value.set_varchar("value1");
    value.set_collation_type(CS_TYPE_UTF8MB4_GENERAL_CI);
    entity.set_property("v1", value);

    value.set_int(123);
    entity.set_property("v2", value);

    ObTableOperation table_op = ObTableOperation::insert(entity);
    ObTableOperationResult result;
    ret = table->execute(table_op, result);

    // 5. GET operation
    ObTableEntity entity_get;
    entity_get.set_rowkey(rk);

    ObObj null_obj;
    entity_get.set_property(ObString::make_string("v1"), null_obj);
    entity_get.set_property(ObString::make_string("v2"), null_obj);

    ObTableOperation get_op = ObTableOperation::retrieve(entity_get);
    ObTableOperationResult result_get;
    ret = table->execute(get_op, result_get);

    // 6. UPDATE operation
    entity.reset();
    entity.set_rowkey(rk);
    value.set_int(666);
    entity.set_property("v2", value);

    table_op = ObTableOperation::update(entity);
    ret = table->execute(table_op, result);

    // 7. DELETE operation
    table_op = ObTableOperation::del(entity);
    ret = table->execute(table_op, result);

    // 8. Cleanup
    service_client.free_table(table);
    service_client.destroy();
    ObTableServiceClient::free_client(p_service_client);
    ObTableServiceLibrary::destroy();

    return ret;
}

// Supports: insert, update, replace, insert_or_update, delete, retrieve
// High-performance direct access bypassing SQL layer
```

### C++ API: Key-Value Store with ObPStore

Use the ObPStore interface for HBase-style column-family key-value operations with versioning support.

```cpp
#include "libobtable.h"
using namespace oceanbase::table;
using namespace oceanbase::common;

int main() {
    int ret = OB_SUCCESS;

    // Initialize library and client (same as ObTable example)
    ret = ObTableServiceLibrary::init();
    ObTableServiceClient* p_service_client = ObTableServiceClient::alloc_client();
    ObTableServiceClient &service_client = *p_service_client;

    ret = service_client.init(
        ObString::make_string("127.0.0.1"), 2881, 2881,
        ObString::make_string("sys"),
        ObString::make_string("root"),
        ObString::make_string(""),
        ObString::make_string("test"),
        ObString::make_string("")
    );

    // Initialize PStore
    ObPStore pstore;
    ret = pstore.init(service_client);

    // PUT operation (write key-value with version)
    ObHKVTable::Key key;
    ObHKVTable::Value value;

    key.rowkey_ = ObString::make_string("abc");
    key.column_qualifier_ = ObString::make_string("cq");
    key.version_ = 123;  // Timestamp/version

    value.set_varchar(ObString::make_string("value1"));

    ret = pstore.put(
        ObString::make_string("t4"),     // table name
        ObString::make_string("cf1"),    // column family
        key,
        value
    );

    // GET operation
    ObHKVTable::Value value_out;
    ret = pstore.get(
        ObString::make_string("t4"),
        ObString::make_string("cf1"),
        key,
        value_out
    );

    // REMOVE operation
    ret = pstore.remove(
        ObString::make_string("t4"),
        ObString::make_string("cf1"),
        key
    );

    // MULTI-PUT (batch operation)
    ObHKVTable::Keys keys;
    ObHKVTable::Values values;

    for (int64_t i = 0; i < 16; ++i) {
        key.rowkey_ = ObString::make_string("abc");
        key.column_qualifier_ = ObString::make_string("cq");
        key.version_ = 123 + i;
        keys.push_back(key);

        value.set_varchar(ObString::make_string("value1"));
        values.push_back(value);
    }

    ret = pstore.multi_put(
        ObString::make_string("t4"),
        ObString::make_string("cf1"),
        keys,
        values
    );

    // MULTI-GET (batch retrieval)
    ObHKVTable::Values values_out;
    ret = pstore.multi_get(
        ObString::make_string("t4"),
        ObString::make_string("cf1"),
        keys,
        values_out
    );

    // MULTI-REMOVE (batch deletion)
    ret = pstore.multi_remove(
        ObString::make_string("t4"),
        ObString::make_string("cf1"),
        keys
    );

    // Cleanup
    service_client.destroy();
    ObTableServiceClient::free_client(p_service_client);
    ObTableServiceLibrary::destroy();

    return 0;
}

// PStore provides HBase-compatible API with:
// - Row key, column qualifier, and version (timestamp)
// - Batch operations for high throughput
// - Column family organization
```

### Docker Deployment

Deploy seekdb in containerized environments for quick testing, development, or production workloads.

```bash
# Run seekdb in Docker with persistent storage
docker run -d \
  --name seekdb \
  -p 2881:2881 \
  -v ./data:/var/lib/oceanbase/store \
  oceanbase/seekdb:latest

# Container exposes:
# - Port 2881: MySQL protocol and seekdb API
# - Volume mount: Persistent data storage

# Connect using MySQL client
mysql -h 127.0.0.1 -P 2881 -u root -p test

# Or use Python SDK
import pyseekdb
client = pyseekdb.Client(host="127.0.0.1", port=2881, database="test", user="root", password="")
```

### Binary Installation and Build

Install seekdb from RPM packages or build from source for production deployments on Linux systems.

```bash
# Install from RPM (production)
rpm -ivh seekdb-1.0.0.0-xxxxxxx.el8.x86_64.rpm

# Or build from source (development)
git clone https://github.com/oceanbase/seekdb.git
cd seekdb

# Build in debug mode with all dependencies
bash build.sh debug --init --make

# Setup runtime directory
mkdir -p ~/seekdb/bin
cp build_debug/src/observer/observer ~/seekdb/bin
cd ~/seekdb

# Start the server
./bin/observer

# Server starts on default port 2881
# Ready to accept connections from MySQL clients or SDK
# Configuration: Use fresh directory for testing
# Logs: Located in ~/seekdb/log/
```

## Summary and Integration Patterns

seekdb serves as a unified data platform for AI-native applications requiring semantic search, full-text retrieval, and traditional database operations. Primary use cases include RAG systems for enterprise knowledge bases, semantic search engines for e-commerce and content platforms, agentic AI applications with memory and context management, and edge AI deployments on resource-constrained devices. The database excels in scenarios where multiple data modalities (structured, unstructured, vector) must be queried together, eliminating the complexity of maintaining separate specialized databases for each data type.

Integration patterns leverage seekdb's dual-interface architecture: Python SDK for rapid AI application development with frameworks like LangChain and LlamaIndex, and SQL interface for enterprise application integration using standard database drivers. The embedded mode enables offline AI applications on edge devices, mobile platforms, and development environments, while server mode provides multi-tenant, high-concurrency access for production workloads. C++ APIs offer low-level control for performance-critical applications. seekdb's MySQL compatibility ensures seamless migration from existing MySQL-based systems, while built-in embedding functions and hybrid search eliminate the need for external vector processing pipelines, enabling true document-in/data-out RAG workflows entirely within the database layer.