### Docs2DB Development Setup and Testing

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md

Commands for setting up the docs2db development environment, including cloning the repository, installing dependencies, and running tests and pre-commit hooks.

```bash
git clone https://github.com/rhel-lightspeed/docs2db
cd docs2db
uv sync
pre-commit install

# Run tests
make test

# Run all checks
pre-commit run --all-files
```

--------------------------------

### Clone and Install Dependencies for Docs2DB

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md

Clones the Docs2DB repository and installs project dependencies using uv. This is the initial setup step for developers.

```bash
git clone https://github.com/rhel-lightspeed/docs2db
cd docs2db
uv sync
```

--------------------------------

### Install Optional WatsonX Dependencies (Bash)

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md

These commands show how to install optional dependencies for WatsonX support. The first command is for local development within the repository, synchronizing dependencies with the 'watsonx' extra. The second command is for general tool installation, specifying the 'docs2db[watsonx]' package to include WatsonX capabilities.

```bash
# For development (local repo):
uv sync --extra watsonx

# For tool installation:
uv tool install 'docs2db[watsonx]'
```

--------------------------------

### Install Docs2DB using uv

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md

Installs the docs2db tool using the `uv` package manager. This command is used to set up the necessary tools for building RAG databases.

```bash
uv tool install docs2db
```

--------------------------------

### Add Docs2DB as a Project Dependency (TOML)

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/INTEGRATION.md

Demonstrates different methods for adding Docs2DB as a dependency in a Python project using `pyproject.toml`. Options include local editable installs, direct Git repository links, and standard PyPI package installation.

```toml
[project]
dependencies = [
    "docs2db",
    # ... other dependencies
]

[tool.uv.sources]
docs2db = { path = "../docs2db", editable = true }

```

```toml
[project]
dependencies = [
    "docs2db @ git+https://github.com/rhel-lightspeed/docs2db.git",
]

```

```toml
[project]
dependencies = [
    "docs2db>=0.1.0",
]

```

--------------------------------

### Test Package Installation Locally (uv)

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md

A sequence of commands to test the installation of the built package in an isolated virtual environment. This helps verify that the package builds correctly and installs without issues.

```bash
# Build the wheel
cd /path/to/docs2db
uv build

# Test installation in isolated environment
cd /tmp
mkdir test-install && cd test-install
uv venv
uv pip install /path/to/docs2db/dist/docs2db-*.whl

# Verify CLI works
uv run docs2db --help
uv run docs2db pipeline --help
```

--------------------------------

### Start Docs2DB PostgreSQL Database

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md

Starts a PostgreSQL database server using Podman or Docker, which is required for storing the RAG database. This command is part of the database lifecycle management.

```bash
docs2db db-start
```

--------------------------------

### Set Up and Run Docs2DB with IBM WatsonX

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/LLM_PROVIDERS.md

This section details the setup and execution of docs2db using IBM WatsonX. It requires setting API key and project ID as environment variables, then running the chunking command with the WatsonX URL and a specified context model.

```bash
# Set environment variables
export WATSONX_API_KEY="your-api-key-here"
export WATSONX_PROJECT_ID="your-project-id-here"

# Run chunking with WatsonX
uv run docs2db chunk \
  --watsonx-url "https://us-south.ml.cloud.ibm.com" \
  --context-model "ibm/granite-13b-chat-v2"
```

--------------------------------

### Run Docs2DB Pipeline for Quickstart

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md

Executes the complete docs2db pipeline to process documents and create a RAG database. This command ingests documents, generates chunks and embeddings, and loads them into PostgreSQL, producing a `ragdb_dump.sql` file.

```bash
docs2db pipeline /path/to/your/documents
```

--------------------------------

### DatabaseManager Class for Advanced Database Operations

Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt

Illustrates the use of the `DatabaseManager` class for low-level database operations. It covers getting database configuration, initializing the schema, retrieving database statistics, getting and updating RAG settings, and generating a manifest file.

```python
import asyncio
from docs2db.database import DatabaseManager, get_db_config

async def main():
    # Get database configuration from environment/compose file
    config = get_db_config()

    # Create database manager
    db_manager = DatabaseManager(
        host=config["host"],
        port=int(config["port"]),
        database=config["database"],
        user=config["user"],
        password=config["password"]
    )

    # Initialize schema (creates tables if needed)
    await db_manager.initialize_schema()

    # Get database statistics
    stats = await db_manager.get_stats()
    print(f"Documents: {stats['documents']}")
    print(f"Chunks: {stats['chunks']}")
    print(f"Embedding models: {stats['embedding_models']}")

    # Get RAG settings
    settings = await db_manager.get_rag_settings()
    if settings:
        print(f"Max chunks: {settings['max_chunks']}")

    # Update RAG settings
    await db_manager.update_rag_settings(
        enable_refinement=True,
        enable_reranking=True,
        similarity_threshold=0.7,
        max_chunks=10
    )

    # Generate manifest of all source files
    await db_manager.generate_manifest("manifest.txt")

asyncio.run(main())
```

--------------------------------

### Commit Message Example (Git)

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md

An example of a clear and descriptive Git commit message, including a summary line and bullet points detailing the changes. It also shows how to credit AI assistance using the 'Co-authored-by' tag.

```bash
Add contextual chunking support

- Implement LLM-based context generation
- Add OpenAI and WatsonX providers
- Include map-reduce for large documents

Refactor database connection logic

- Simplify connection pooling
- Add retry logic for transient failures

Co-authored-by: Claude 4.5 Sonnet
```

--------------------------------

### Database Lifecycle Functions

Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt

Functions for managing the PostgreSQL database container using Podman or Docker, including starting, stopping, destroying the container, and retrieving logs.

```APIDOC
## Database Lifecycle Functions

### Description
Functions for managing the PostgreSQL database container using Podman or Docker, including starting, stopping, destroying the container, and retrieving logs.

### Functions
- **`start_database()`**: Starts the PostgreSQL database container. Creates `postgres-compose.yml` if it doesn't exist.
- **`stop_database()`**: Stops the PostgreSQL database container.
- **`destroy_database()`**: Stops and removes the PostgreSQL database container and its associated volumes.
- **`get_database_logs()`**: Retrieves the logs from the PostgreSQL database container.
- **`detect_container_runtime()`**: Detects the available container runtime (Podman or Docker).

### Parameters
None for these functions.

### Request Example
```python
from docs2db.db_lifecycle import (
    start_database,
    stop_database,
    destroy_database,
    get_database_logs,
    detect_container_runtime
)

# Detect available container runtime
runtime = detect_container_runtime()  # Returns "podman", "docker", or None

# Start PostgreSQL database
# Creates postgres-compose.yml if it doesn't exist
success = start_database()

# Stop PostgreSQL database
# success = stop_database()

# Destroy PostgreSQL database
# success = destroy_database()

# Get database logs
# logs = get_database_logs()
# print(logs)
```

### Response
#### Success Response
- **`detect_container_runtime()`**: Returns a string indicating the detected runtime ('podman', 'docker') or None.
- **`start_database()`**, **`stop_database()`**, **`destroy_database()`**: Return a boolean indicating success or failure.
- **`get_database_logs()`**: Returns a string containing the database logs.
```

--------------------------------

### Install Pre-commit Hooks for Docs2DB

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md

Installs pre-commit hooks to automate code quality checks and formatting before each commit. This ensures code consistency and adherence to project standards.

```bash
uv run pre-commit install
```

--------------------------------

### Run Docs2DB Full Pipeline

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md

Executes the complete Docs2DB pipeline, including database setup, ingestion, chunking, embedding, loading, and dumping. This is a quick way to test the entire workflow.

```bash
uv run docs2db pipeline tests/fixtures/input
```

--------------------------------

### Load Documents into PostgreSQL Asynchronously

Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt

Demonstrates the asynchronous loading of documents, chunks, and embeddings into a PostgreSQL database using the `load_documents` function. Examples include auto-detected and explicit database settings, loading specific content with metadata, and forcing a reload.

```python
import asyncio
from docs2db.database import load_documents

async def main():
    # Load with auto-detected database settings
    success = await load_documents()

    # Load with explicit database connection
    success = await load_documents(
        host="localhost",
        port=5432,
        db="ragdb",
        user="postgres",
        password="postgres"
    )

    # Load specific content with metadata
    success = await load_documents(
        content_dir="docs2db_content",
        model="ibm-granite/granite-embedding-30m-english",
        pattern="api_docs/**",
        title="API Documentation Database",
        description="Complete API reference for v2.0",
        username="build-bot",
        note="Automated nightly build"
    )

    # Force reload existing documents
    success = await load_documents(force=True, batch_size=50)

    return success

# Run async function
success = asyncio.run(main())
```

--------------------------------

### Update CHANGELOG (Markdown)

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md

An example of updating the `CHANGELOG.md` file to document changes for a new release. It follows a structured format with sections for Added, Changed, and Fixed items.

```markdown
## [0.2.0] - 2024-11-15

### Added
- New `pipeline` command for end-to-end workflow
- Database lifecycle commands (`db-start`, `db-stop`, etc.)

### Changed
- Improved PostgreSQL configuration with multi-tier precedence

### Fixed
- Database connection error messages now suggest correct CLI commands
```

--------------------------------

### Example Full Metadata Structure (JSON)

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/METADATA.md

This JSON object represents a comprehensive metadata file, including details about the filesystem, content, source, and processing stages. It serves as a template for understanding the structure and potential fields within a .meta.json file.

```json
{
  "metadata_version": "1.0",

  "filesystem": {
    "original_path": "/sources/docs/guide.html",
    "size_bytes": 245680,
    "mtime": "2025-10-23T10:30:00Z",
    "detected_mime": "text/html"
  },

  "content": {
    "title": "ExampleTech 9.4 Administration Guide",
    "language": "en"
  },

  "source": {
    "source_type": "graphql",
    "source_url": "https://docs.example.com/en/documentation/exampletech/9.4/html/system_administrators_guide/index",
    "source_etag": "abc123def456",
    "retrieved_at": "2025-10-23T10:30:00Z",
    "retriever": "example-documentation-v1.0",
    "license": "CC-BY-SA-4.0"
  },

  "processing": {
    "source_hash": "xxh64:a1b2c3d4e5f6...",
    "ingested_at": "2025-10-23T10:31:00Z",
    "docling_version": "2.42.1"
  }
}
```

--------------------------------

### Example Minimal Metadata Structure (JSON)

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/METADATA.md

This JSON object demonstrates a minimal metadata file, containing only essential auto-detected information like file size and processing timestamps. It highlights the sparse nature of metadata files, where omitted sections indicate no data was available or relevant.

```json
{
  "metadata_version": "1.0",
  "filesystem": {
    "size_bytes": 12540
  },
  "processing": {
    "ingested_at": "2025-10-23T10:31:00Z",
    "docling_version": "2.42.1"
  }
}
```

--------------------------------

### Embedding Class for Programmatic Embedding Creation

Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt

Provides examples of using the `Embedding` class for low-level control over embedding generation. It demonstrates creating an instance from a model name, ensuring model availability, generating embeddings for a file, and forcing regeneration.

```python
from docs2db.embeddings import Embedding
from pathlib import Path

# Create embedding instance from model name
embedding = Embedding.from_name("ibm-granite/granite-embedding-30m-english")

# Ensure model is downloaded locally
embedding.ensure_available()

# Generate embeddings for a chunks file
chunks_file = Path("docs2db_content/my_docs/document/chunks.json")
embeddings_file = embedding.generate_embedding(chunks_file)

if embeddings_file:
    print(f"Embeddings saved to: {embeddings_file}")
    # Creates: docs2db_content/my_docs/document/gran.json

# Force regeneration
embeddings_file = embedding.generate_embedding(chunks_file, force=True)

# Available models:
# - ibm-granite/granite-embedding-30m-english (384 dims, default)
# - ibm/slate-125m-english-rtrvr-v2 (768 dims)
# - intfloat/e5-small-v2 (384 dims)
# - avsolatorio/NoInstruct-small-Embedding-v0 (384 dims)
```

--------------------------------

### Query RAG Database with Python Library

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md

This Python code snippet demonstrates how to use the UniversalRAGEngine from the docs2db_api library to perform hybrid search on a RAG database. It initializes the engine with a configuration, starts it to auto-detect the database and embedding model, and then searches for documents based on a query.

```python
from docs2db_api.rag.engine import UniversalRAGEngine, RAGConfig

config = RAGConfig(similarity_threshold=0.7, max_chunks=5)
engine = UniversalRAGEngine(config=config)
await engine.start()  # Auto-detects database and embedding model
results = await engine.search_documents("How do I configure authentication?")
```

--------------------------------

### Settings Configuration: Environment Variables and Programmatic Access

Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt

Demonstrates how to configure settings using environment variables or a .env file, and how to access these settings programmatically using the `settings` object from `docs2db.config`. Covers database and LLM configurations.

```dotenv
# .env file example
CONTENT_BASE_DIR=docs2db_content
LLM_SKIP_CONTEXT=false
LLM_PROVIDER=openai
LLM_CONTEXT_MODEL=qwen2.5:7b-instruct
LLM_OPENAI_URL=http://localhost:11434
LLM_WATSONX_URL=https://us-south.ml.cloud.ibm.com
WATSONX_API_KEY=your-api-key
WATSONX_PROJECT_ID=your-project-id
EMBEDDING_MODEL=ibm-granite/granite-embedding-30m-english
```

```python
from docs2db.config import settings

print(f"Content dir: {settings.content_base_dir}")
print(f"Embedding model: {settings.embedding_model}")
print(f"LLM provider: {settings.llm_provider}")
print(f"Context model: {settings.llm_context_model}")
```

--------------------------------

### Database Lifecycle Management Functions

Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt

Shows how to manage the PostgreSQL container lifecycle using functions like `start_database`, `stop_database`, `destroy_database`, and `get_database_logs`. It also includes detecting the container runtime.

```python
from docs2db.db_lifecycle import (
    start_database,
    stop_database,
    destroy_database,
    get_database_logs,
    detect_container_runtime
)

# Detect available container runtime
runtime = detect_container_runtime()  # Returns "podman", "docker", or None

# Start PostgreSQL database
# Creates postgres-compose.yml if it doesn't exist
success = start_database()
```

--------------------------------

### Generate Chunks with WatsonX or OpenAI-compatible API

Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt

Demonstrates how to use the `generate_chunks` function with different providers like WatsonX or an OpenAI-compatible API. It also shows how to override the context model limit for large documents.

```python
from docs2db.chunk import generate_chunks

# Use WatsonX for context generation
success = generate_chunks(
    provider="watsonx",
    watsonx_url="https://us-south.ml.cloud.ibm.com",
    context_model="ibm/granite-3-8b-instruct"
)

# Use OpenAI-compatible API
success = generate_chunks(
    provider="openai",
    openai_url="https://api.openai.com",
    context_model="gpt-4o-mini"
)

# Override model context limit for large documents
success = generate_chunks(
    context_model="qwen2.5:7b-instruct",
    context_limit_override=16000  # tokens
)
```

--------------------------------

### Test Docs2DB RAG Demo Client

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md

Demonstrates how to use the RAG (Retrieval-Augmented Generation) demo client for Docs2DB. It includes commands for both interactive queries and single-query execution.

```bash
uv run python scripts/rag_demo_client.py --query "your test query"
uv run python scripts/rag_demo_client.py --interactive
```

--------------------------------

### Manage Docs2DB Test Database

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md

Commands to manage the PostgreSQL database specifically for running tests. This includes starting, stopping, and destroying the test database container.

```bash
make db-up-test
make db-down-test
make db-destroy-test
```

--------------------------------

### Build and Publish Package (uv)

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md

Commands to build the Python package distribution files and publish it to PyPI using the `uv` tool. This requires a PyPI token for authentication.

```bash
# Build distribution files
uv build

# Publish to PyPI (requires PyPI token)
uv publish
```

--------------------------------

### Manage Docs2DB Development Database

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md

Commands to manage the main PostgreSQL database used for development in Docs2DB. Includes starting, stopping, destroying, and checking the status of the database.

```bash
uv run docs2db db-start
uv run docs2db db-stop
uv run docs2db db-destroy
uv run docs2db db-status
```

--------------------------------

### Set Up and Run Docs2DB with OpenAI

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/LLM_PROVIDERS.md

This section outlines the process for configuring docs2db to use OpenAI models. It involves setting the OpenAI API key as an environment variable and then executing the chunking command with the OpenAI API URL and a chosen context model.

```bash
# Set environment variable
export OPENAI_API_KEY="sk-..."

# Run chunking with OpenAI
uv run docs2db chunk \
  --openai-url "https://api.openai.com" \
  --context-model "gpt-4o-mini"
```

--------------------------------

### Load Data into Docs2DB Database with Defaults

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md

Loads processed document data into the PostgreSQL database using default connection settings. This command assumes the database is already started.

```bash
docs2db load
```

--------------------------------

### Use Local Small Model for Production Speed

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/LLM_PROVIDERS.md

This command configures docs2db chunk for production with a speed priority by using a local small model, specifically 'qwen2.5:3b-instruct'.

```bash
uv run docs2db chunk --context-model qwen2.5:3b-instruct
```

--------------------------------

### Docs2DB Metadata File Structure (JSON)

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/INTEGRATION.md

The `meta.json` file stores versioned metadata about the ingested document. It includes details about the filesystem, content, source, and processing information, such as hashes and timestamps.

```json
{
  "metadata_version": "1.0",

  "filesystem": {
    "original_path": "documentation/example/9/guide.json",
    "size_bytes": 12540
  },

  "content": {
    "title": "ExampleTech Installation Guide",
    "language": "en"
  },

  "source": {
    "source_type": "graphql",
    "source_url": "https://docs.example.com/",
    "source_etag": "abc123",
    "retrieved_at": "2025-10-23T10:30:00Z",
    "retriever": "example-graphql-v1.0",
    "license": "CC-BY-SA-4.0"
  },

  "processing": {
    "source_hash": "xxh64:a1b2c3d4e5f6",
    "ingested_at": "2025-10-23T10:31:00Z",
    "docling_version": "2.44.0"
  }
}

```

--------------------------------

### Run Tests and Pre-commit Checks (Makefile)

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md

Commands to execute the project's test suite and pre-commit hooks. These are essential steps to ensure code quality and correctness before submitting changes.

```bash
make test
uv run pre-commit run --all-files
```

--------------------------------

### Docs2DB CLI - Manifest Command

Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt

The `manifest` command generates a list of all unique source files currently stored in the database.

```APIDOC
## CLI: docs2db manifest

### Description
Generates a manifest file listing all unique source files in the database.

### Method
CLI Command

### Endpoint
N/A

### Parameters
#### Query Parameters
- **output-file** (string) - Optional - Specifies a custom file path for the manifest output. Defaults to standard output.

### Request Example
```bash
# Generate manifest to standard output
docs2db manifest

# Generate manifest to a custom file
docs2db manifest --output-file sources.txt
```

### Response
#### Success Response (0)
A list of source file paths, one per line.

#### Response Example
```
docs2db_content/my_docs/document/source.json
docs2db_content/api_docs/getting_started/source.json
```
```

--------------------------------

### Ingest In-Memory Content with Docs2DB (Python)

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/INTEGRATION.md

The `ingest_from_content` function processes in-memory content (string or bytes) and generates Docling JSON and metadata files. It requires the content, a path for storage, a stream name to infer format, and optional source metadata and encoding.

```python
from pathlib import Path
from docs2db.ingest import ingest_from_content

# Prepare your content
html_content = "<html><body><h1>My Document</h1></body></html>"

# Build source metadata (optional but recommended)
source_metadata = {
    "source_type": "graphql",
    "source_url": "https://docs.example.com/",
    "source_etag": "abc123",
    "retrieved_at": "2025-10-23T10:30:00Z",
    "retriever": "example-graphql-v1.0",
    "license": "CC-BY-SA-4.0",
}

# Ingest the content
success = ingest_from_content(
    content=html_content,
    content_path=Path("content/documentation/exampletech/9/guide"),
    stream_name="guide.html",  # Extension tells docling this is HTML
    source_metadata=source_metadata,
)

if success:
    print("✅ Document ingested successfully!")
    # Files created:
    #   - content/documentation/exampletech/9/guide/source.json (Docling JSON)
    #   - content/documentation/exampletech/9/guide/meta.json (Metadata)

```

--------------------------------

### Generate Manifest File with docs2db CLI

Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt

Generates a manifest file listing all unique source files currently stored in the database. This is useful for tracking and managing ingested documents. An optional output file can be specified.

```bash
# Generate manifest file
docs2db manifest

# Custom output file
docs2db manifest --output-file sources.txt
```

--------------------------------

### Docs2DB Database Troubleshooting Commands

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md

Commands to manage and check the status of the docs2db database. Useful for resolving 'Database connection refused' errors.

```bash
docs2db db-start      # Start the database
docs2db db-status     # Check connection
```

--------------------------------

### Skip Context Generation for Development/Testing

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/LLM_PROVIDERS.md

This command skips context generation entirely, which is useful for development and testing purposes to speed up processing.

```bash
uv run docs2db chunk --skip-context
```

--------------------------------

### JavaScript: Google Tag Manager Initialization

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/tests/fixtures/input/web/pages/renewable-energy.html

This script initializes and configures Google Tag Manager (GTM) for consent management and loading. It sets default consent settings for various storage types to 'denied' and then asynchronously loads the GTM script. This is a standard practice for integrating GTM into a website.

```javascript
/* Prepare Google Tag Manager */
window.dataLayer = window.dataLayer || [];
function gtag(){
  dataLayer.push(arguments);
}
gtag("consent", "default", {
  "ad_storage": "denied",
  "ad_user_data": "denied",
  "ad_personalization": "denied",
  "analytics_storage": "denied",
  "wait_for_update": 1000
});

/* Load Google Tag Manager */
(function(w,d,s,l,i){
  w[l]=w[l]||[];
  w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});
  var f=d.getElementsByTagName(s)[0],
  j=d.createElement(s),
  dl=l!='dataLayer'?'&l='+l:'';
  j.async=true;
  j.src=
  'https://www.googletagmanager.com/gtm.js?id='+i+dl;
  f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-N2D4V8S');
```

--------------------------------

### Run Docs2DB Pipeline with Custom Options

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md

Executes the docs2db pipeline with custom configurations, including specifying an output file, skipping contextual chunking for faster processing, and using a different embedding model.

```bash
docs2db pipeline <path> \
  --output-file my-rag.sql \
  --skip-context \
  --model intfloat/e5-small-v2
```

--------------------------------

### Check OpenAI API Key

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/LLM_PROVIDERS.md

This command checks if the OpenAI API key environment variable is set. It's a troubleshooting step for authentication issues.

```bash
echo $OPENAI_API_KEY
```

--------------------------------

### Docs2DB Chunking Options

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md

Customization options for the `docs2db chunk` command, including skipping context generation, specifying LLM providers (Ollama, OpenAI, WatsonX), and defining patterns or content directories.

```bash
# Fast (skip contextual generation)
docs2db chunk --skip-context

# Custom LLM provider
docs2db chunk --context-model qwen2.5:7b-instruct              # Ollama
docs2db chunk --openai-url https://api.openai.com \
  --context-model gpt-4o-mini
docs2db chunk --watsonx-url https://us-south.ml.cloud.ibm.com # WatsonX

# Patterns and directories
docs2db chunk --pattern "docs/**"
docs2db chunk --content-dir my-content
```

--------------------------------

### DatabaseManager Class

Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt

Provides low-level database operations for advanced use cases, including schema initialization, statistics retrieval, RAG settings management, and manifest generation.

```APIDOC
## DatabaseManager Class

### Description
Provides low-level database operations for advanced use cases, including schema initialization, statistics retrieval, RAG settings management, and manifest generation.

### Method
`DatabaseManager` class

### Instance Methods
- **`initialize_schema()`**: Initializes the database schema, creating tables if they do not exist.
- **`get_stats()`**: Retrieves statistics about the database content (documents, chunks, embedding models).
- **`get_rag_settings()`**: Retrieves the current Retrieval-Augmented Generation (RAG) settings.
- **`update_rag_settings(...)`**: Updates the RAG settings with new values.
- **`generate_manifest(output_file: str)`**: Generates a manifest file listing all source files in the database.

### Parameters
#### `DatabaseManager` Constructor Parameters
- **host** (str) - Required - The database host address.
- **port** (int) - Required - The database port.
- **database** (str) - Required - The database name.
- **user** (str) - Required - The database username.
- **password** (str) - Required - The database password.

#### `update_rag_settings` Parameters
- **enable_refinement** (bool) - Optional - Enables or disables refinement.
- **enable_reranking** (bool) - Optional - Enables or disables reranking.
- **similarity_threshold** (float) - Optional - The similarity threshold for retrieval.
- **max_chunks** (int) - Optional - The maximum number of chunks to retrieve.

#### `generate_manifest` Parameters
- **output_file** (str) - Required - The path to the file where the manifest will be saved.

### Request Example
```python
import asyncio
from docs2db.database import DatabaseManager, get_db_config

async def main():
    # Get database configuration from environment/compose file
    config = get_db_config()

    # Create database manager
    db_manager = DatabaseManager(
        host=config["host"],
        port=int(config["port"]),
        database=config["database"],
        user=config["user"],
        password=config["password"]
    )

    # Initialize schema (creates tables if needed)
    await db_manager.initialize_schema()

    # Get database statistics
    stats = await db_manager.get_stats()
    print(f"Documents: {stats['documents']}")
    print(f"Chunks: {stats['chunks']}")
    print(f"Embedding models: {stats['embedding_models']}")

    # Get RAG settings
    settings = await db_manager.get_rag_settings()
    if settings:
        print(f"Max chunks: {settings['max_chunks']}")

    # Update RAG settings
    await db_manager.update_rag_settings(
        enable_refinement=True,
        enable_reranking=True,
        similarity_threshold=0.7,
        max_chunks=10
    )

    # Generate manifest of all source files
    await db_manager.generate_manifest("manifest.txt")

asyncio.run(main())
```

### Response
#### Success Response
- **stats** (dict) - Dictionary containing database statistics.
- **settings** (dict or None) - Dictionary containing RAG settings, or None if not configured.
- **`generate_manifest`** returns None upon successful completion.
```

--------------------------------

### Pull Ollama Model

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/LLM_PROVIDERS.md

This command pulls a specific model, 'qwen2.5:3b-instruct', from Ollama. It's a troubleshooting step for 'Model not found' errors when using Ollama.

```bash
ollama pull qwen2.5:3b-instruct
```

--------------------------------

### Database Operations: Stop, Logs, Destroy

Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt

Provides functions to manage the database, including stopping it to preserve data, viewing logs (with an option for follow mode), and destroying the database and all associated data.

```python
success = stop_database()

# View logs
success = get_database_logs()
success = get_database_logs(follow=True)  # Follow mode

# Destroy database and all data
success = destroy_database()
```

--------------------------------

### Docs2DB CLI Ingest and Processing Commands

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md

Core commands for ingesting and processing documents with docs2db. Each step generates intermediate files in the `docs2db_content/` directory.

```bash
docs2db ingest <path>                # Ingest documents
docs2db chunk                        # Generate chunks
docs2db embed                        # Generate embeddings
docs2db load                         # Load into database
docs2db db-dump                      # Create SQL dump
docs2db db-restore <file>            # Restore from dump
docs2db audit                        # Check content directory
```

--------------------------------

### Python Library: ingest_from_content

Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt

Converts in-memory content (HTML, markdown, etc.) directly to Docling JSON without requiring intermediate files.

```APIDOC
## Python Library: ingest_from_content

### Description
Ingests content directly from memory (e.g., strings) into Docling JSON format, useful for dynamic or API-fetched content.

### Method
Python Function

### Endpoint
N/A

### Parameters
#### Arguments
- **content** (str or bytes) - Required - The content to ingest.
- **content_path** (Path) - Required - The directory where the Docling JSON and metadata will be stored.
- **stream_name** (str) - Required - The name of the stream, including the file extension, which helps in format detection (e.g., `"document.html"`, `"report.md"`).
- **source_metadata** (dict) - Optional - Metadata about the source of the content.
- **content_encoding** (str) - Optional - The encoding of the content if it's provided as bytes (e.g., `"utf-16"`).

### Request Example
```python
from pathlib import Path
from docs2db.ingest import ingest_from_content

# Ingest HTML content from memory
html_content = """
<html>
<head><title>API Documentation</title></head>
<body>
<h1>Getting Started</h1>
<p>Welcome to our API documentation...</p>
</body>
</html>
"""

success = ingest_from_content(
    content=html_content,
    content_path=Path("docs2db_content/api_docs/getting_started"),
    stream_name="getting_started.html",  # Extension determines format detection
    source_metadata={
        "source_url": "https://api.example.com/docs/getting-started",
        "retrieved_at": "2024-01-15T10:30:00Z"
    }
)

# Ingest markdown content
md_content = "# User Guide\n\nThis guide covers..."
success = ingest_from_content(
    content=md_content,
    content_path=Path("docs2db_content/guides/user_guide"),
    stream_name="user_guide.md"
)

# Ingest with custom encoding
success = ingest_from_content(
    content=html_content.encode("utf-16"),
    content_path=Path("docs2db_content/legacy/doc"),
    stream_name="doc.html",
    content_encoding="utf-16"
)
```

### Response
#### Success Response (boolean)
- **True** if ingestion was successful.
- **False** if ingestion failed.

#### Response Example
```
True
```
```

--------------------------------

### Use Cloud Model for Production Quality

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/LLM_PROVIDERS.md

This command configures docs2db chunk for production with a quality priority by using a capable cloud model, 'gpt-4o-mini', and specifies the LLM base URL.

```bash
uv run docs2db chunk \
  --llm-base-url "https://api.openai.com" \
  --context-model "gpt-4o-mini"
```

--------------------------------

### Configure Docs2DB Chunking with Faster Local Ollama Models

Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/LLM_PROVIDERS.md

These commands demonstrate how to configure docs2db to use faster, smaller local models with Ollama. You can specify different model sizes like 3B, 1.5B, or alternative fast models like Llama 3.2 or Gemma 2. A custom Ollama URL can also be provided.

```bash
# 3B model (2-3x faster)
uv run docs2db chunk --context-model qwen2.5:3b-instruct

# 1.5B model (4-5x faster, may be lower quality)
uv run docs2db chunk --context-model qwen2.5:1.5b-instruct

# Alternative fast models
uv run docs2db chunk --context-model llama3.2:3b-instruct
uv run docs2db chunk --context-model gemma2:2b-instruct

# Custom Ollama URL
uv run docs2db chunk --openai-url "http://localhost:11434" --context-model qwen2.5:7b-instruct
```