### Docs2DB Development Setup and Testing Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md Commands for setting up the docs2db development environment, including cloning the repository, installing dependencies, and running tests and pre-commit hooks. ```bash git clone https://github.com/rhel-lightspeed/docs2db cd docs2db uv sync pre-commit install # Run tests make test # Run all checks pre-commit run --all-files ``` -------------------------------- ### Clone and Install Dependencies for Docs2DB Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md Clones the Docs2DB repository and installs project dependencies using uv. This is the initial setup step for developers. ```bash git clone https://github.com/rhel-lightspeed/docs2db cd docs2db uv sync ``` -------------------------------- ### Install Optional WatsonX Dependencies (Bash) Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md These commands show how to install optional dependencies for WatsonX support. The first command is for local development within the repository, synchronizing dependencies with the 'watsonx' extra. The second command is for general tool installation, specifying the 'docs2db[watsonx]' package to include WatsonX capabilities. ```bash # For development (local repo): uv sync --extra watsonx # For tool installation: uv tool install 'docs2db[watsonx]' ``` -------------------------------- ### Install Docs2DB using uv Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md Installs the docs2db tool using the `uv` package manager. This command is used to set up the necessary tools for building RAG databases. ```bash uv tool install docs2db ``` -------------------------------- ### Add Docs2DB as a Project Dependency (TOML) Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/INTEGRATION.md Demonstrates different methods for adding Docs2DB as a dependency in a Python project using `pyproject.toml`. Options include local editable installs, direct Git repository links, and standard PyPI package installation. ```toml [project] dependencies = [ "docs2db", # ... other dependencies ] [tool.uv.sources] docs2db = { path = "../docs2db", editable = true } ``` ```toml [project] dependencies = [ "docs2db @ git+https://github.com/rhel-lightspeed/docs2db.git", ] ``` ```toml [project] dependencies = [ "docs2db>=0.1.0", ] ``` -------------------------------- ### Test Package Installation Locally (uv) Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md A sequence of commands to test the installation of the built package in an isolated virtual environment. This helps verify that the package builds correctly and installs without issues. ```bash # Build the wheel cd /path/to/docs2db uv build # Test installation in isolated environment cd /tmp mkdir test-install && cd test-install uv venv uv pip install /path/to/docs2db/dist/docs2db-*.whl # Verify CLI works uv run docs2db --help uv run docs2db pipeline --help ``` -------------------------------- ### Start Docs2DB PostgreSQL Database Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md Starts a PostgreSQL database server using Podman or Docker, which is required for storing the RAG database. This command is part of the database lifecycle management. ```bash docs2db db-start ``` -------------------------------- ### Set Up and Run Docs2DB with IBM WatsonX Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/LLM_PROVIDERS.md This section details the setup and execution of docs2db using IBM WatsonX. It requires setting API key and project ID as environment variables, then running the chunking command with the WatsonX URL and a specified context model. ```bash # Set environment variables export WATSONX_API_KEY="your-api-key-here" export WATSONX_PROJECT_ID="your-project-id-here" # Run chunking with WatsonX uv run docs2db chunk \ --watsonx-url "https://us-south.ml.cloud.ibm.com" \ --context-model "ibm/granite-13b-chat-v2" ``` -------------------------------- ### Run Docs2DB Pipeline for Quickstart Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md Executes the complete docs2db pipeline to process documents and create a RAG database. This command ingests documents, generates chunks and embeddings, and loads them into PostgreSQL, producing a `ragdb_dump.sql` file. ```bash docs2db pipeline /path/to/your/documents ``` -------------------------------- ### DatabaseManager Class for Advanced Database Operations Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt Illustrates the use of the `DatabaseManager` class for low-level database operations. It covers getting database configuration, initializing the schema, retrieving database statistics, getting and updating RAG settings, and generating a manifest file. ```python import asyncio from docs2db.database import DatabaseManager, get_db_config async def main(): # Get database configuration from environment/compose file config = get_db_config() # Create database manager db_manager = DatabaseManager( host=config["host"], port=int(config["port"]), database=config["database"], user=config["user"], password=config["password"] ) # Initialize schema (creates tables if needed) await db_manager.initialize_schema() # Get database statistics stats = await db_manager.get_stats() print(f"Documents: {stats['documents']}") print(f"Chunks: {stats['chunks']}") print(f"Embedding models: {stats['embedding_models']}") # Get RAG settings settings = await db_manager.get_rag_settings() if settings: print(f"Max chunks: {settings['max_chunks']}") # Update RAG settings await db_manager.update_rag_settings( enable_refinement=True, enable_reranking=True, similarity_threshold=0.7, max_chunks=10 ) # Generate manifest of all source files await db_manager.generate_manifest("manifest.txt") asyncio.run(main()) ``` -------------------------------- ### Commit Message Example (Git) Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md An example of a clear and descriptive Git commit message, including a summary line and bullet points detailing the changes. It also shows how to credit AI assistance using the 'Co-authored-by' tag. ```bash Add contextual chunking support - Implement LLM-based context generation - Add OpenAI and WatsonX providers - Include map-reduce for large documents Refactor database connection logic - Simplify connection pooling - Add retry logic for transient failures Co-authored-by: Claude 4.5 Sonnet ``` -------------------------------- ### Database Lifecycle Functions Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt Functions for managing the PostgreSQL database container using Podman or Docker, including starting, stopping, destroying the container, and retrieving logs. ```APIDOC ## Database Lifecycle Functions ### Description Functions for managing the PostgreSQL database container using Podman or Docker, including starting, stopping, destroying the container, and retrieving logs. ### Functions - **`start_database()`**: Starts the PostgreSQL database container. Creates `postgres-compose.yml` if it doesn't exist. - **`stop_database()`**: Stops the PostgreSQL database container. - **`destroy_database()`**: Stops and removes the PostgreSQL database container and its associated volumes. - **`get_database_logs()`**: Retrieves the logs from the PostgreSQL database container. - **`detect_container_runtime()`**: Detects the available container runtime (Podman or Docker). ### Parameters None for these functions. ### Request Example ```python from docs2db.db_lifecycle import ( start_database, stop_database, destroy_database, get_database_logs, detect_container_runtime ) # Detect available container runtime runtime = detect_container_runtime() # Returns "podman", "docker", or None # Start PostgreSQL database # Creates postgres-compose.yml if it doesn't exist success = start_database() # Stop PostgreSQL database # success = stop_database() # Destroy PostgreSQL database # success = destroy_database() # Get database logs # logs = get_database_logs() # print(logs) ``` ### Response #### Success Response - **`detect_container_runtime()`**: Returns a string indicating the detected runtime ('podman', 'docker') or None. - **`start_database()`**, **`stop_database()`**, **`destroy_database()`**: Return a boolean indicating success or failure. - **`get_database_logs()`**: Returns a string containing the database logs. ``` -------------------------------- ### Install Pre-commit Hooks for Docs2DB Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md Installs pre-commit hooks to automate code quality checks and formatting before each commit. This ensures code consistency and adherence to project standards. ```bash uv run pre-commit install ``` -------------------------------- ### Run Docs2DB Full Pipeline Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md Executes the complete Docs2DB pipeline, including database setup, ingestion, chunking, embedding, loading, and dumping. This is a quick way to test the entire workflow. ```bash uv run docs2db pipeline tests/fixtures/input ``` -------------------------------- ### Load Documents into PostgreSQL Asynchronously Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt Demonstrates the asynchronous loading of documents, chunks, and embeddings into a PostgreSQL database using the `load_documents` function. Examples include auto-detected and explicit database settings, loading specific content with metadata, and forcing a reload. ```python import asyncio from docs2db.database import load_documents async def main(): # Load with auto-detected database settings success = await load_documents() # Load with explicit database connection success = await load_documents( host="localhost", port=5432, db="ragdb", user="postgres", password="postgres" ) # Load specific content with metadata success = await load_documents( content_dir="docs2db_content", model="ibm-granite/granite-embedding-30m-english", pattern="api_docs/**", title="API Documentation Database", description="Complete API reference for v2.0", username="build-bot", note="Automated nightly build" ) # Force reload existing documents success = await load_documents(force=True, batch_size=50) return success # Run async function success = asyncio.run(main()) ``` -------------------------------- ### Update CHANGELOG (Markdown) Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md An example of updating the `CHANGELOG.md` file to document changes for a new release. It follows a structured format with sections for Added, Changed, and Fixed items. ```markdown ## [0.2.0] - 2024-11-15 ### Added - New `pipeline` command for end-to-end workflow - Database lifecycle commands (`db-start`, `db-stop`, etc.) ### Changed - Improved PostgreSQL configuration with multi-tier precedence ### Fixed - Database connection error messages now suggest correct CLI commands ``` -------------------------------- ### Example Full Metadata Structure (JSON) Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/METADATA.md This JSON object represents a comprehensive metadata file, including details about the filesystem, content, source, and processing stages. It serves as a template for understanding the structure and potential fields within a .meta.json file. ```json { "metadata_version": "1.0", "filesystem": { "original_path": "/sources/docs/guide.html", "size_bytes": 245680, "mtime": "2025-10-23T10:30:00Z", "detected_mime": "text/html" }, "content": { "title": "ExampleTech 9.4 Administration Guide", "language": "en" }, "source": { "source_type": "graphql", "source_url": "https://docs.example.com/en/documentation/exampletech/9.4/html/system_administrators_guide/index", "source_etag": "abc123def456", "retrieved_at": "2025-10-23T10:30:00Z", "retriever": "example-documentation-v1.0", "license": "CC-BY-SA-4.0" }, "processing": { "source_hash": "xxh64:a1b2c3d4e5f6...", "ingested_at": "2025-10-23T10:31:00Z", "docling_version": "2.42.1" } } ``` -------------------------------- ### Example Minimal Metadata Structure (JSON) Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/METADATA.md This JSON object demonstrates a minimal metadata file, containing only essential auto-detected information like file size and processing timestamps. It highlights the sparse nature of metadata files, where omitted sections indicate no data was available or relevant. ```json { "metadata_version": "1.0", "filesystem": { "size_bytes": 12540 }, "processing": { "ingested_at": "2025-10-23T10:31:00Z", "docling_version": "2.42.1" } } ``` -------------------------------- ### Embedding Class for Programmatic Embedding Creation Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt Provides examples of using the `Embedding` class for low-level control over embedding generation. It demonstrates creating an instance from a model name, ensuring model availability, generating embeddings for a file, and forcing regeneration. ```python from docs2db.embeddings import Embedding from pathlib import Path # Create embedding instance from model name embedding = Embedding.from_name("ibm-granite/granite-embedding-30m-english") # Ensure model is downloaded locally embedding.ensure_available() # Generate embeddings for a chunks file chunks_file = Path("docs2db_content/my_docs/document/chunks.json") embeddings_file = embedding.generate_embedding(chunks_file) if embeddings_file: print(f"Embeddings saved to: {embeddings_file}") # Creates: docs2db_content/my_docs/document/gran.json # Force regeneration embeddings_file = embedding.generate_embedding(chunks_file, force=True) # Available models: # - ibm-granite/granite-embedding-30m-english (384 dims, default) # - ibm/slate-125m-english-rtrvr-v2 (768 dims) # - intfloat/e5-small-v2 (384 dims) # - avsolatorio/NoInstruct-small-Embedding-v0 (384 dims) ``` -------------------------------- ### Query RAG Database with Python Library Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md This Python code snippet demonstrates how to use the UniversalRAGEngine from the docs2db_api library to perform hybrid search on a RAG database. It initializes the engine with a configuration, starts it to auto-detect the database and embedding model, and then searches for documents based on a query. ```python from docs2db_api.rag.engine import UniversalRAGEngine, RAGConfig config = RAGConfig(similarity_threshold=0.7, max_chunks=5) engine = UniversalRAGEngine(config=config) await engine.start() # Auto-detects database and embedding model results = await engine.search_documents("How do I configure authentication?") ``` -------------------------------- ### Settings Configuration: Environment Variables and Programmatic Access Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt Demonstrates how to configure settings using environment variables or a .env file, and how to access these settings programmatically using the `settings` object from `docs2db.config`. Covers database and LLM configurations. ```dotenv # .env file example CONTENT_BASE_DIR=docs2db_content LLM_SKIP_CONTEXT=false LLM_PROVIDER=openai LLM_CONTEXT_MODEL=qwen2.5:7b-instruct LLM_OPENAI_URL=http://localhost:11434 LLM_WATSONX_URL=https://us-south.ml.cloud.ibm.com WATSONX_API_KEY=your-api-key WATSONX_PROJECT_ID=your-project-id EMBEDDING_MODEL=ibm-granite/granite-embedding-30m-english ``` ```python from docs2db.config import settings print(f"Content dir: {settings.content_base_dir}") print(f"Embedding model: {settings.embedding_model}") print(f"LLM provider: {settings.llm_provider}") print(f"Context model: {settings.llm_context_model}") ``` -------------------------------- ### Database Lifecycle Management Functions Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt Shows how to manage the PostgreSQL container lifecycle using functions like `start_database`, `stop_database`, `destroy_database`, and `get_database_logs`. It also includes detecting the container runtime. ```python from docs2db.db_lifecycle import ( start_database, stop_database, destroy_database, get_database_logs, detect_container_runtime ) # Detect available container runtime runtime = detect_container_runtime() # Returns "podman", "docker", or None # Start PostgreSQL database # Creates postgres-compose.yml if it doesn't exist success = start_database() ``` -------------------------------- ### Generate Chunks with WatsonX or OpenAI-compatible API Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt Demonstrates how to use the `generate_chunks` function with different providers like WatsonX or an OpenAI-compatible API. It also shows how to override the context model limit for large documents. ```python from docs2db.chunk import generate_chunks # Use WatsonX for context generation success = generate_chunks( provider="watsonx", watsonx_url="https://us-south.ml.cloud.ibm.com", context_model="ibm/granite-3-8b-instruct" ) # Use OpenAI-compatible API success = generate_chunks( provider="openai", openai_url="https://api.openai.com", context_model="gpt-4o-mini" ) # Override model context limit for large documents success = generate_chunks( context_model="qwen2.5:7b-instruct", context_limit_override=16000 # tokens ) ``` -------------------------------- ### Test Docs2DB RAG Demo Client Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md Demonstrates how to use the RAG (Retrieval-Augmented Generation) demo client for Docs2DB. It includes commands for both interactive queries and single-query execution. ```bash uv run python scripts/rag_demo_client.py --query "your test query" uv run python scripts/rag_demo_client.py --interactive ``` -------------------------------- ### Manage Docs2DB Test Database Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md Commands to manage the PostgreSQL database specifically for running tests. This includes starting, stopping, and destroying the test database container. ```bash make db-up-test make db-down-test make db-destroy-test ``` -------------------------------- ### Build and Publish Package (uv) Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md Commands to build the Python package distribution files and publish it to PyPI using the `uv` tool. This requires a PyPI token for authentication. ```bash # Build distribution files uv build # Publish to PyPI (requires PyPI token) uv publish ``` -------------------------------- ### Manage Docs2DB Development Database Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md Commands to manage the main PostgreSQL database used for development in Docs2DB. Includes starting, stopping, destroying, and checking the status of the database. ```bash uv run docs2db db-start uv run docs2db db-stop uv run docs2db db-destroy uv run docs2db db-status ``` -------------------------------- ### Set Up and Run Docs2DB with OpenAI Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/LLM_PROVIDERS.md This section outlines the process for configuring docs2db to use OpenAI models. It involves setting the OpenAI API key as an environment variable and then executing the chunking command with the OpenAI API URL and a chosen context model. ```bash # Set environment variable export OPENAI_API_KEY="sk-..." # Run chunking with OpenAI uv run docs2db chunk \ --openai-url "https://api.openai.com" \ --context-model "gpt-4o-mini" ``` -------------------------------- ### Load Data into Docs2DB Database with Defaults Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md Loads processed document data into the PostgreSQL database using default connection settings. This command assumes the database is already started. ```bash docs2db load ``` -------------------------------- ### Use Local Small Model for Production Speed Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/LLM_PROVIDERS.md This command configures docs2db chunk for production with a speed priority by using a local small model, specifically 'qwen2.5:3b-instruct'. ```bash uv run docs2db chunk --context-model qwen2.5:3b-instruct ``` -------------------------------- ### Docs2DB Metadata File Structure (JSON) Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/INTEGRATION.md The `meta.json` file stores versioned metadata about the ingested document. It includes details about the filesystem, content, source, and processing information, such as hashes and timestamps. ```json { "metadata_version": "1.0", "filesystem": { "original_path": "documentation/example/9/guide.json", "size_bytes": 12540 }, "content": { "title": "ExampleTech Installation Guide", "language": "en" }, "source": { "source_type": "graphql", "source_url": "https://docs.example.com/", "source_etag": "abc123", "retrieved_at": "2025-10-23T10:30:00Z", "retriever": "example-graphql-v1.0", "license": "CC-BY-SA-4.0" }, "processing": { "source_hash": "xxh64:a1b2c3d4e5f6", "ingested_at": "2025-10-23T10:31:00Z", "docling_version": "2.44.0" } } ``` -------------------------------- ### Run Tests and Pre-commit Checks (Makefile) Source: https://github.com/rhel-lightspeed/docs2db/blob/main/CONTRIBUTING.md Commands to execute the project's test suite and pre-commit hooks. These are essential steps to ensure code quality and correctness before submitting changes. ```bash make test uv run pre-commit run --all-files ``` -------------------------------- ### Docs2DB CLI - Manifest Command Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt The `manifest` command generates a list of all unique source files currently stored in the database. ```APIDOC ## CLI: docs2db manifest ### Description Generates a manifest file listing all unique source files in the database. ### Method CLI Command ### Endpoint N/A ### Parameters #### Query Parameters - **output-file** (string) - Optional - Specifies a custom file path for the manifest output. Defaults to standard output. ### Request Example ```bash # Generate manifest to standard output docs2db manifest # Generate manifest to a custom file docs2db manifest --output-file sources.txt ``` ### Response #### Success Response (0) A list of source file paths, one per line. #### Response Example ``` docs2db_content/my_docs/document/source.json docs2db_content/api_docs/getting_started/source.json ``` ``` -------------------------------- ### Ingest In-Memory Content with Docs2DB (Python) Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/INTEGRATION.md The `ingest_from_content` function processes in-memory content (string or bytes) and generates Docling JSON and metadata files. It requires the content, a path for storage, a stream name to infer format, and optional source metadata and encoding. ```python from pathlib import Path from docs2db.ingest import ingest_from_content # Prepare your content html_content = "

My Document

" # Build source metadata (optional but recommended) source_metadata = { "source_type": "graphql", "source_url": "https://docs.example.com/", "source_etag": "abc123", "retrieved_at": "2025-10-23T10:30:00Z", "retriever": "example-graphql-v1.0", "license": "CC-BY-SA-4.0", } # Ingest the content success = ingest_from_content( content=html_content, content_path=Path("content/documentation/exampletech/9/guide"), stream_name="guide.html", # Extension tells docling this is HTML source_metadata=source_metadata, ) if success: print("✅ Document ingested successfully!") # Files created: # - content/documentation/exampletech/9/guide/source.json (Docling JSON) # - content/documentation/exampletech/9/guide/meta.json (Metadata) ``` -------------------------------- ### Generate Manifest File with docs2db CLI Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt Generates a manifest file listing all unique source files currently stored in the database. This is useful for tracking and managing ingested documents. An optional output file can be specified. ```bash # Generate manifest file docs2db manifest # Custom output file docs2db manifest --output-file sources.txt ``` -------------------------------- ### Docs2DB Database Troubleshooting Commands Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md Commands to manage and check the status of the docs2db database. Useful for resolving 'Database connection refused' errors. ```bash docs2db db-start # Start the database docs2db db-status # Check connection ``` -------------------------------- ### Skip Context Generation for Development/Testing Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/LLM_PROVIDERS.md This command skips context generation entirely, which is useful for development and testing purposes to speed up processing. ```bash uv run docs2db chunk --skip-context ``` -------------------------------- ### JavaScript: Google Tag Manager Initialization Source: https://github.com/rhel-lightspeed/docs2db/blob/main/tests/fixtures/input/web/pages/renewable-energy.html This script initializes and configures Google Tag Manager (GTM) for consent management and loading. It sets default consent settings for various storage types to 'denied' and then asynchronously loads the GTM script. This is a standard practice for integrating GTM into a website. ```javascript /* Prepare Google Tag Manager */ window.dataLayer = window.dataLayer || []; function gtag(){ dataLayer.push(arguments); } gtag("consent", "default", { "ad_storage": "denied", "ad_user_data": "denied", "ad_personalization": "denied", "analytics_storage": "denied", "wait_for_update": 1000 }); /* Load Google Tag Manager */ (function(w,d,s,l,i){ w[l]=w[l]||[]; w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'}); var f=d.getElementsByTagName(s)[0], j=d.createElement(s), dl=l!='dataLayer'?'&l='+l:''; j.async=true; j.src= 'https://www.googletagmanager.com/gtm.js?id='+i+dl; f.parentNode.insertBefore(j,f); })(window,document,'script','dataLayer','GTM-N2D4V8S'); ``` -------------------------------- ### Run Docs2DB Pipeline with Custom Options Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md Executes the docs2db pipeline with custom configurations, including specifying an output file, skipping contextual chunking for faster processing, and using a different embedding model. ```bash docs2db pipeline \ --output-file my-rag.sql \ --skip-context \ --model intfloat/e5-small-v2 ``` -------------------------------- ### Check OpenAI API Key Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/LLM_PROVIDERS.md This command checks if the OpenAI API key environment variable is set. It's a troubleshooting step for authentication issues. ```bash echo $OPENAI_API_KEY ``` -------------------------------- ### Docs2DB Chunking Options Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md Customization options for the `docs2db chunk` command, including skipping context generation, specifying LLM providers (Ollama, OpenAI, WatsonX), and defining patterns or content directories. ```bash # Fast (skip contextual generation) docs2db chunk --skip-context # Custom LLM provider docs2db chunk --context-model qwen2.5:7b-instruct # Ollama docs2db chunk --openai-url https://api.openai.com \ --context-model gpt-4o-mini docs2db chunk --watsonx-url https://us-south.ml.cloud.ibm.com # WatsonX # Patterns and directories docs2db chunk --pattern "docs/**" docs2db chunk --content-dir my-content ``` -------------------------------- ### DatabaseManager Class Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt Provides low-level database operations for advanced use cases, including schema initialization, statistics retrieval, RAG settings management, and manifest generation. ```APIDOC ## DatabaseManager Class ### Description Provides low-level database operations for advanced use cases, including schema initialization, statistics retrieval, RAG settings management, and manifest generation. ### Method `DatabaseManager` class ### Instance Methods - **`initialize_schema()`**: Initializes the database schema, creating tables if they do not exist. - **`get_stats()`**: Retrieves statistics about the database content (documents, chunks, embedding models). - **`get_rag_settings()`**: Retrieves the current Retrieval-Augmented Generation (RAG) settings. - **`update_rag_settings(...)`**: Updates the RAG settings with new values. - **`generate_manifest(output_file: str)`**: Generates a manifest file listing all source files in the database. ### Parameters #### `DatabaseManager` Constructor Parameters - **host** (str) - Required - The database host address. - **port** (int) - Required - The database port. - **database** (str) - Required - The database name. - **user** (str) - Required - The database username. - **password** (str) - Required - The database password. #### `update_rag_settings` Parameters - **enable_refinement** (bool) - Optional - Enables or disables refinement. - **enable_reranking** (bool) - Optional - Enables or disables reranking. - **similarity_threshold** (float) - Optional - The similarity threshold for retrieval. - **max_chunks** (int) - Optional - The maximum number of chunks to retrieve. #### `generate_manifest` Parameters - **output_file** (str) - Required - The path to the file where the manifest will be saved. ### Request Example ```python import asyncio from docs2db.database import DatabaseManager, get_db_config async def main(): # Get database configuration from environment/compose file config = get_db_config() # Create database manager db_manager = DatabaseManager( host=config["host"], port=int(config["port"]), database=config["database"], user=config["user"], password=config["password"] ) # Initialize schema (creates tables if needed) await db_manager.initialize_schema() # Get database statistics stats = await db_manager.get_stats() print(f"Documents: {stats['documents']}") print(f"Chunks: {stats['chunks']}") print(f"Embedding models: {stats['embedding_models']}") # Get RAG settings settings = await db_manager.get_rag_settings() if settings: print(f"Max chunks: {settings['max_chunks']}") # Update RAG settings await db_manager.update_rag_settings( enable_refinement=True, enable_reranking=True, similarity_threshold=0.7, max_chunks=10 ) # Generate manifest of all source files await db_manager.generate_manifest("manifest.txt") asyncio.run(main()) ``` ### Response #### Success Response - **stats** (dict) - Dictionary containing database statistics. - **settings** (dict or None) - Dictionary containing RAG settings, or None if not configured. - **`generate_manifest`** returns None upon successful completion. ``` -------------------------------- ### Pull Ollama Model Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/LLM_PROVIDERS.md This command pulls a specific model, 'qwen2.5:3b-instruct', from Ollama. It's a troubleshooting step for 'Model not found' errors when using Ollama. ```bash ollama pull qwen2.5:3b-instruct ``` -------------------------------- ### Database Operations: Stop, Logs, Destroy Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt Provides functions to manage the database, including stopping it to preserve data, viewing logs (with an option for follow mode), and destroying the database and all associated data. ```python success = stop_database() # View logs success = get_database_logs() success = get_database_logs(follow=True) # Follow mode # Destroy database and all data success = destroy_database() ``` -------------------------------- ### Docs2DB CLI Ingest and Processing Commands Source: https://github.com/rhel-lightspeed/docs2db/blob/main/README.md Core commands for ingesting and processing documents with docs2db. Each step generates intermediate files in the `docs2db_content/` directory. ```bash docs2db ingest # Ingest documents docs2db chunk # Generate chunks docs2db embed # Generate embeddings docs2db load # Load into database docs2db db-dump # Create SQL dump docs2db db-restore # Restore from dump docs2db audit # Check content directory ``` -------------------------------- ### Python Library: ingest_from_content Source: https://context7.com/rhel-lightspeed/docs2db/llms.txt Converts in-memory content (HTML, markdown, etc.) directly to Docling JSON without requiring intermediate files. ```APIDOC ## Python Library: ingest_from_content ### Description Ingests content directly from memory (e.g., strings) into Docling JSON format, useful for dynamic or API-fetched content. ### Method Python Function ### Endpoint N/A ### Parameters #### Arguments - **content** (str or bytes) - Required - The content to ingest. - **content_path** (Path) - Required - The directory where the Docling JSON and metadata will be stored. - **stream_name** (str) - Required - The name of the stream, including the file extension, which helps in format detection (e.g., `"document.html"`, `"report.md"`). - **source_metadata** (dict) - Optional - Metadata about the source of the content. - **content_encoding** (str) - Optional - The encoding of the content if it's provided as bytes (e.g., `"utf-16"`). ### Request Example ```python from pathlib import Path from docs2db.ingest import ingest_from_content # Ingest HTML content from memory html_content = """ API Documentation

Getting Started

Welcome to our API documentation...

""" success = ingest_from_content( content=html_content, content_path=Path("docs2db_content/api_docs/getting_started"), stream_name="getting_started.html", # Extension determines format detection source_metadata={ "source_url": "https://api.example.com/docs/getting-started", "retrieved_at": "2024-01-15T10:30:00Z" } ) # Ingest markdown content md_content = "# User Guide\n\nThis guide covers..." success = ingest_from_content( content=md_content, content_path=Path("docs2db_content/guides/user_guide"), stream_name="user_guide.md" ) # Ingest with custom encoding success = ingest_from_content( content=html_content.encode("utf-16"), content_path=Path("docs2db_content/legacy/doc"), stream_name="doc.html", content_encoding="utf-16" ) ``` ### Response #### Success Response (boolean) - **True** if ingestion was successful. - **False** if ingestion failed. #### Response Example ``` True ``` ``` -------------------------------- ### Use Cloud Model for Production Quality Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/LLM_PROVIDERS.md This command configures docs2db chunk for production with a quality priority by using a capable cloud model, 'gpt-4o-mini', and specifies the LLM base URL. ```bash uv run docs2db chunk \ --llm-base-url "https://api.openai.com" \ --context-model "gpt-4o-mini" ``` -------------------------------- ### Configure Docs2DB Chunking with Faster Local Ollama Models Source: https://github.com/rhel-lightspeed/docs2db/blob/main/docs/LLM_PROVIDERS.md These commands demonstrate how to configure docs2db to use faster, smaller local models with Ollama. You can specify different model sizes like 3B, 1.5B, or alternative fast models like Llama 3.2 or Gemma 2. A custom Ollama URL can also be provided. ```bash # 3B model (2-3x faster) uv run docs2db chunk --context-model qwen2.5:3b-instruct # 1.5B model (4-5x faster, may be lower quality) uv run docs2db chunk --context-model qwen2.5:1.5b-instruct # Alternative fast models uv run docs2db chunk --context-model llama3.2:3b-instruct uv run docs2db chunk --context-model gemma2:2b-instruct # Custom Ollama URL uv run docs2db chunk --openai-url "http://localhost:11434" --context-model qwen2.5:7b-instruct ```