### Install Dependencies

Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_milvus.ipynb

Installs necessary packages for Docling and Haystack. Restart the kernel if prompted.

```bash
%pip install -q --progress-bar off docling-haystack haystack-ai docling pymilvus milvus-haystack sentence-transformers python-dotenv
```

--------------------------------

### Install Docling and Haystack Packages

Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_basic.ipynb

Install the necessary packages for Docling and Haystack. A kernel restart might be required after installation.

```bash
%pip install -q --progress-bar off docling-haystack haystack-ai docling python-dotenv
```

```bash
%pip install -q --progress-bar off haystack-ai docling python-dotenv
```

--------------------------------

### Install Project Dependencies with Poetry

Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md

Installs all the dependencies listed in your pyproject.toml file. Ensure the virtual environment is activated first.

```bash
poetry install
```

--------------------------------

### Install Poetry Globally

Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md

This command installs Poetry globally on your machine using the official installer script. Ensure you note the installation bin folder for the next steps.

```bash
curl -sSL https://install.python-poetry.org | python3 -
```

--------------------------------

### License Header Example

Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md

Include this header at the beginning of each source file to comply with the MIT Software license. It uses the SPDX format for identification.

```text
/*
Copyright IBM Inc. All rights reserved.

SPDX-License-Identifier: MIT
*/
```

--------------------------------

### Install Pre-commit Hooks

Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md

Installs the git hooks managed by pre-commit. These hooks run automatically on commits to enforce coding standards.

```bash
pre-commit install
```

--------------------------------

### Add a New Dependency with Poetry

Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md

Adds a new package to your project's dependencies and installs it. Replace `NAME` with the actual package name.

```bash
poetry add NAME
```

--------------------------------

### Convert Single PDF to Markdown

Source: https://github.com/docling-project/docling-haystack/blob/main/test/data/2408.09869v5.md

Basic usage example for converting a single PDF document from a URL or file path to Markdown format. Ensure the 'docling' package is installed.

```python
from docling.document_converter import DocumentConverter
source = "https ://arxiv.org/pdf /2206.01062" # PDF path or URL
converter = DocumentConverter ()
result = converter.convert_single(source)
print(result.render_as_markdown ()) # output: "## DocLayNet: A Large Human -Annotated Dataset for Document -Layout Analysis [...]"
```

--------------------------------

### Add Poetry to Bash Path

Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md

Append this line to your ~/.bashrc file to ensure Poetry is accessible in your Bash shell. Replace POETRY_BIN with the actual installation bin folder.

```bash
echo 'export PATH="POETRY_BIN:$PATH"' >> ~/.bashrc
```

--------------------------------

### Add Poetry to Zsh Path

Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md

Append this line to your ~/.zshrc file to ensure Poetry is accessible in your Zsh shell. Replace POETRY_BIN with the actual installation bin folder.

```bash
echo 'export PATH="POETRY_BIN:$PATH"' >> ~/.zshrc
```

--------------------------------

### DCO Sign-off Example

Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md

Developers must include a sign-off statement in their commit messages to indicate acceptance of the Developer's Certificate of Origin (DCO).

```text
Signed-off-by: John Doe <john.doe@example.com>
```

--------------------------------

### Initialize Document Store and Indexing Pipeline

Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_basic.ipynb

Set up an in-memory document store and create a pipeline to convert documents using DoclingConverter and write them to the store.

```python
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

from docling_haystack.converter import DoclingConverter

document_store = InMemoryDocumentStore()

idx_pipe = Pipeline()
idx_pipe.add_component("converter", DoclingConverter())
idx_pipe.add_component("writer", DocumentWriter(document_store=document_store))
idx_pipe.connect("converter", "writer")
idx_pipe.run({"converter": {"paths": PATHS}})
```

--------------------------------

### Configuration and Constants

Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_milvus.ipynb

Sets up configuration variables including API keys, document paths, model IDs, and Milvus connection details.

```python
import os
from pathlib import Path
from tempfile import mkdtemp

from docling_haystack.converter import ExportType

HF_TOKEN = os.getenv("HF_API_KEY", "")
PATHS = [
    "https://arxiv.org/pdf/2408.09869",  # Docling Technical Report
    # ... additional docs can be listed here
]
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
GENERATION_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"
EXPORT_TYPE = ExportType.DOC_CHUNKS
QUESTION = "Which are the main AI models in Docling?"
TOP_K = 3
MILVUS_URI = str(Path(mkdtemp()) / "docling.db")
```

--------------------------------

### Initialize Document Store and Indexing Pipeline

Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_milvus.ipynb

Sets up the Milvus document store and a Haystack pipeline for converting, embedding, and writing documents. Handles different export types and document splitting.

```python
from docling.chunking import HybridChunker
from haystack import Pipeline
from haystack.components.embedders import (
    SentenceTransformersDocumentEmbedder,
    SentenceTransformersTextEmbedder,
)
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from milvus_haystack import MilvusDocumentStore, MilvusEmbeddingRetriever

from docling_haystack.converter import DoclingConverter

document_store = MilvusDocumentStore(
    connection_args={"uri": MILVUS_URI},
    drop_old=True,
    text_field="txt",  # set for preventing conflict with same-name metadata field
)

idx_pipe = Pipeline()
idx_pipe.add_component(
    "converter",
    DoclingConverter(
        export_type=EXPORT_TYPE,
        chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),
    ),
)
idx_pipe.add_component(
    "embedder",
    SentenceTransformersDocumentEmbedder(model=EMBED_MODEL_ID),
)
idx_pipe.add_component("writer", DocumentWriter(document_store=document_store))
if EXPORT_TYPE == ExportType.DOC_CHUNKS:
    idx_pipe.connect("converter", "embedder")
elif EXPORT_TYPE == ExportType.MARKDOWN:
    idx_pipe.add_component(
        "splitter",
        DocumentSplitter(split_by="sentence", split_length=1),
    )
    idx_pipe.connect("converter.documents", "splitter.documents")
    idx_pipe.connect("splitter.documents", "embedder.documents")
else:
    raise ValueError(f"Unexpected export type: {EXPORT_TYPE}")
idx_pipe.connect("embedder", "writer")
idx_pipe.run({"converter": {"paths": PATHS}})
```

--------------------------------

### Configure API Key and Document Paths

Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_basic.ipynb

Set up the Hugging Face API key and define the list of document paths for processing.

```python
import os

HF_TOKEN = os.getenv("HF_API_KEY", "")
PATHS = [
    "https://arxiv.org/pdf/2408.09869",  # Docling Technical Report
    # ... additional docs can be listed here
]
GENERATION_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"
QUESTION = "Which are the main AI models in Docling?"
TOP_K = 3
```

--------------------------------

### Basic DoclingConverter Usage

Source: https://context7.com/docling-project/docling-haystack/llms.txt

Demonstrates basic usage of DoclingConverter with default settings (DOC_CHUNKS mode) to convert documents from URLs and local paths into a Haystack document store.

```python
from pathlib import Path
from docling.chunking import HybridChunker
from docling.document_converter import DocumentConverter
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

from docling_haystack.converter import DoclingConverter, ExportType

# Basic usage with default settings (DOC_CHUNKS mode)
document_store = InMemoryDocumentStore()

idx_pipe = Pipeline()
idx_pipe.add_component("converter", DoclingConverter())
idx_pipe.add_component("writer", DocumentWriter(document_store=document_store))
idx_pipe.connect("converter", "writer")

# Convert documents from URLs or local paths
result = idx_pipe.run({
    "converter": {
        "paths": [
            "https://arxiv.org/pdf/2408.09869",  # URL
            "/path/to/local/document.pdf",        # Local path
        ]
    }
})
# Output: {'writer': {'documents_written': 54}}

# Access converted documents
for doc in document_store.filter_documents():
    print(f"Content: {doc.content[:100]}...")
    print(f"Metadata: {doc.meta}")
```

--------------------------------

### Use a Specific Python Version with Poetry

Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md

Creates a new virtual environment using a specified Python interpreter. Replace `$(which python3.9)` with the desired Python executable path.

```bash
poetry env use $(which python3.9)
```

--------------------------------

### RAG Pipeline with InMemoryBM25Retriever

Source: https://context7.com/docling-project/docling-haystack/llms.txt

Set up a RAG pipeline using an in-memory document store and BM25 retriever. Ensure necessary components like Pipeline, PromptBuilder, HuggingFaceAPIGenerator, and AnswerBuilder are imported.

```python
prompt_template = """
Given these documents, answer the question.
Documents:
{% for doc in documents %}
    {{ doc.content }}
{% endfor %}
Question: {{query}}
Answer:
"""

rag_pipe = Pipeline()
rag_pipe.add_component(
    "retriever",
    InMemoryBM25Retriever(document_store=document_store, top_k=TOP_K),
)
rag_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template))
rag_pipe.add_component(
    "llm",
    HuggingFaceAPIGenerator(
        api_type="serverless_inference_api",
        api_params={"model": GENERATION_MODEL_ID},
        token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None,
    ),
)
rag_pipe.add_component("answer_builder", AnswerBuilder())
rag_pipe.connect("retriever", "prompt_builder.documents")
rag_pipe.connect("prompt_builder", "llm")
rag_pipe.connect("llm.replies", "answer_builder.replies")
rag_pipe.connect("llm.meta", "answer_builder.meta")
rag_pipe.connect("retriever", "answer_builder.documents")

# Run query
rag_res = rag_pipe.run({
    "retriever": {"query": QUESTION},
    "prompt_builder": {"query": QUESTION},
    "answer_builder": {"query": QUESTION},
})

# Extract answer and sources with rich metadata
print(f"Question: {QUESTION}")
print(f"Answer: {rag_res['answer_builder']['answers'][0].data.strip()}")
print("\nSources:")
for source in rag_res["answer_builder"]["answers"][0].documents:
    doc_chunk = DocChunk.model_validate(source.meta["dl_meta"])
    print(f"- Text: {repr(doc_chunk.text[:100])}...")
    if doc_chunk.meta.origin:
        print(f"  File: {doc_chunk.meta.origin.filename}")
    if doc_chunk.meta.headings:
        print(f"  Section: {' / '.join(doc_chunk.meta.headings)}")
    bbox = doc_chunk.meta.doc_items[0].prov[0].bbox
    print(f"  Page: {doc_chunk.meta.doc_items[0].prov[0].page_no}, "
          f"BBox: [{int(bbox.l)}, {int(bbox.t)}, {int(bbox.r)}, {int(bbox.b)}]")
```

--------------------------------

### Initialize RAG Pipeline

Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_milvus.ipynb

Constructs a Retrieval-Augmented Generation (RAG) pipeline using Haystack components for embedding, retrieval, prompt building, and generation.

```python
from haystack.components.builders import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.utils import Secret

prompt_template = """
    Given these documents, answer the question.
    Documents:
    {% for doc in documents %}
        {{ doc.content }}
    {% endfor %}
    Question: {{query}}
    Answer:
    """

rag_pipe = Pipeline()
rag_pipe.add_component(
    "embedder",
    SentenceTransformersTextEmbedder(model=EMBED_MODEL_ID),
)
rag_pipe.add_component(
    "retriever",
    MilvusEmbeddingRetriever(document_store=document_store, top_k=TOP_K),
)
rag_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template))
rag_pipe.add_component(
    "llm",
    HuggingFaceAPIGenerator(
        api_type="serverless_inference_api",
        api_params={"model": GENERATION_MODEL_ID},
        token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None,
    ),
)
rag_pipe.add_component("answer_builder", AnswerBuilder())
rag_pipe.connect("embedder.embedding", "retriever")
rag_pipe.connect("retriever", "prompt_builder.documents")
rag_pipe.connect("prompt_builder", "llm")
rag_pipe.connect("llm.replies", "answer_builder.replies")
rag_pipe.connect("llm.meta", "answer_builder.meta")
rag_pipe.connect("retriever", "answer_builder.documents")
rag_res = rag_pipe.run(
    {
        "embedder": {"text": QUESTION},
        "prompt_builder": {"query": QUESTION},
        "answer_builder": {"query": QUESTION},
    }
)
```

--------------------------------

### Build RAG Pipeline with Semantic Search

Source: https://context7.com/docling-project/docling-haystack/llms.txt

This snippet demonstrates setting up a RAG pipeline in Haystack, including components for embedding, retrieval, prompt building, generation, and answer construction. It connects these components to form a functional pipeline for semantic search and question answering.

```python
prompt_template = """
Given these documents, answer the question.
Documents:
{% for doc in documents %}
    {{ doc.content }}
{% endfor %}
Question: {{query}}
Answer:
"""

rag_pipe = Pipeline()
rag_pipe.add_component(
    "embedder",
    SentenceTransformersTextEmbedder(model=EMBED_MODEL_ID),
)
rag_pipe.add_component(
    "retriever",
    MilvusEmbeddingRetriever(document_store=document_store, top_k=TOP_K),
)
rag_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template))
rag_pipe.add_component(
    "llm",
    HuggingFaceAPIGenerator(
        api_type="serverless_inference_api",
        api_params={"model": GENERATION_MODEL_ID},
        token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None,
    ),
)
rag_pipe.add_component("answer_builder", AnswerBuilder())
rag_pipe.connect("embedder.embedding", "retriever")
rag_pipe.connect("retriever", "prompt_builder.documents")
rag_pipe.connect("prompt_builder", "llm")
rag_pipe.connect("llm.replies", "answer_builder.replies")
rag_pipe.connect("llm.meta", "answer_builder.meta")
rag_pipe.connect("retriever", "answer_builder.documents")

# Run semantic search query
rag_res = rag_pipe.run({
    "embedder": {"text": QUESTION},
    "prompt_builder": {"query": QUESTION},
    "answer_builder": {"query": QUESTION},
})

print(f"Answer: {rag_res['answer_builder']['answers'][0].data.strip()}")
```

--------------------------------

### Custom Chunker Configuration with HybridChunker

Source: https://context7.com/docling-project/docling-haystack/llms.txt

Shows how to configure DoclingConverter with a custom chunker, specifically HybridChunker, and how to specify a tokenizer to match the embedding model. The resulting chunks contain rich metadata for grounding.

```python
from docling.chunking import HybridChunker
from docling_haystack.converter import DoclingConverter, ExportType

# Configure chunker with specific tokenizer (should match your embedding model)
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"

converter = DoclingConverter(
    export_type=ExportType.DOC_CHUNKS,
    chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),
)

result = converter.run(paths=["https://arxiv.org/pdf/2408.09869"])
documents = result["documents"]

# Each chunk contains rich metadata for grounding
for doc in documents[:3]:
    print(f"Chunk: {doc.content[:100]}...")
    print(f"Metadata keys: {doc.meta.keys()}")
    # Metadata includes: dl_meta with origin, headings, page numbers, bounding boxes
```

--------------------------------

### Build RAG Pipeline with DoclingConverter

Source: https://context7.com/docling-project/docling-haystack/llms.txt

Demonstrates a complete RAG pipeline using DoclingConverter for document ingestion into an InMemoryDocumentStore, followed by BM25 retrieval and generation.

```python
import os
from dotenv import load_dotenv
from haystack import Pipeline
from haystack.components.builders import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.utils import Secret
from docling.chunking import DocChunk

from docling_haystack.converter import DoclingConverter

load_dotenv()

# Configuration
HF_TOKEN = os.getenv("HF_API_KEY", "")
PATHS = ["https://arxiv.org/pdf/2408.09869"]
GENERATION_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"
QUESTION = "Which are the main AI models in Docling?"
TOP_K = 3

# Indexing Pipeline
document_store = InMemoryDocumentStore()
idx_pipe = Pipeline()
idx_pipe.add_component("converter", DoclingConverter())
idx_pipe.add_component("writer", DocumentWriter(document_store=document_store))
idx_pipe.connect("converter", "writer")
idx_pipe.run({"converter": {"paths": PATHS}})

```

--------------------------------

### Activate Poetry Shell

Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md

Use this command to activate the virtual environment managed by Poetry. If the environment does not exist, Poetry will create it.

```bash
poetry shell
```

--------------------------------

### RAG Pipeline with Milvus and Sentence Transformers

Source: https://context7.com/docling-project/docling-haystack/llms.txt

Configure a RAG pipeline for production using Milvus for document storage and Sentence Transformers for embeddings. This pipeline includes document conversion, embedding, and writing to the Milvus store.

```python
import os
from pathlib import Path
from tempfile import mkdtemp
from dotenv import load_dotenv
from docling.chunking import HybridChunker
from haystack import Pipeline
from haystack.components.builders import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.embedders import (
    SentenceTransformersDocumentEmbedder,
    SentenceTransformersTextEmbedder,
)
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.components.writers import DocumentWriter
from haystack.utils import Secret
from milvus_haystack import MilvusDocumentStore, MilvusEmbeddingRetriever

from docling_haystack.converter import DoclingConverter, ExportType

load_dotenv()

# Configuration
HF_TOKEN = os.getenv("HF_API_KEY", "")
PATHS = ["https://arxiv.org/pdf/2408.09869"]
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
GENERATION_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"
QUESTION = "Which are the main AI models in Docling?"
TOP_K = 3
MILVUS_URI = str(Path(mkdtemp()) / "docling.db")

# Setup Milvus document store
document_store = MilvusDocumentStore(
    connection_args={"uri": MILVUS_URI},
    drop_old=True,
    text_field="txt",  # Prevent conflict with metadata field named 'content'
)

# Indexing Pipeline with embeddings
idx_pipe = Pipeline()
idx_pipe.add_component(
    "converter",
    DoclingConverter(
        export_type=ExportType.DOC_CHUNKS,
        chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),
    ),
)
idx_pipe.add_component(
    "embedder",
    SentenceTransformersDocumentEmbedder(model=EMBED_MODEL_ID),
)
idx_pipe.add_component("writer", DocumentWriter(document_store=document_store))
idx_pipe.connect("converter", "embedder")
idx_pipe.connect("embedder", "writer")
idx_pipe.run({"converter": {"paths": PATHS}})
# Output: {'writer': {'documents_written': 54}}
```

--------------------------------

### Build and Run RAG Pipeline

Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_basic.ipynb

Construct a Retrieval-Augmented Generation (RAG) pipeline using BM25Retriever, PromptBuilder, HuggingFaceAPIGenerator, and AnswerBuilder.

```python
from haystack.components.builders import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.utils import Secret

prompt_template = """
    Given these documents, answer the question.
    Documents:
    {% for doc in documents %}
        {{ doc.content }}
    {% endfor %}
    Question: {{query}}
    Answer:
    """

rag_pipe = Pipeline()
rag_pipe.add_component(
    "retriever",
    InMemoryBM25Retriever(document_store=document_store, top_k=TOP_K),
)
rag_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template))
rag_pipe.add_component(
    "llm",
    HuggingFaceAPIGenerator(
        api_type="serverless_inference_api",
        api_params={"model": GENERATION_MODEL_ID},
        token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None,
    ),
)
rag_pipe.add_component("answer_builder", AnswerBuilder())
rag_pipe.connect("retriever", "prompt_builder.documents")
rag_pipe.connect("prompt_builder", "llm")
rag_pipe.connect("llm.replies", "answer_builder.replies")
rag_pipe.connect("llm.meta", "answer_builder.meta")
rag_pipe.connect("retriever", "answer_builder.documents")
rag_res = rag_pipe.run(
    {
        "retriever": {"query": QUESTION},
        "prompt_builder": {"query": QUESTION},
        "answer_builder": {"query": QUESTION},
    }
)
```

--------------------------------

### Display Question, Answer, and Sources

Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_basic.ipynb

Print the question, the generated answer, and the sources used to generate the answer, extracting metadata from DocChunk.

```python
from docling.chunking import DocChunk

print(f"Question:\n{QUESTION}\n")
print(f"Answer:\n{rag_res['answer_builder']['answers'][0].data.strip()}\n")
print("Sources:")
sources = rag_res["answer_builder"]["answers"][0].documents
for source in sources:
    doc_chunk = DocChunk.model_validate(source.meta["dl_meta"])
    print(f"- text: {repr(doc_chunk.text)}")
    if doc_chunk.meta.origin:
        print(f"  file: {doc_chunk.meta.origin.filename}")
    if doc_chunk.meta.headings:
        print(f"  section: {' / '.join(doc_chunk.meta.headings)}")
    bbox = doc_chunk.meta.doc_items[0].prov[0].bbox
    print(
        f"  page: {doc_chunk.meta.doc_items[0].prov[0].page_no}, "
        f"bounding box: [{int(bbox.l)}, {int(bbox.t)}, {int(bbox.r)}, {int(bbox.b)}]")
```

--------------------------------

### Load Environment Variables

Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_basic.ipynb

Load environment variables, particularly the Hugging Face API key, using python-dotenv.

```python
from dotenv import load_dotenv

_ = load_dotenv()
```

--------------------------------

### Configure Custom DocumentConverter for PDF

Source: https://context7.com/docling-project/docling-haystack/llms.txt

Set up a DocumentConverter with specific PDF format options, such as enabling OCR. This allows for customized document processing before ingestion.

```python
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling_haystack.converter import DoclingConverter

# Configure custom DocumentConverter with specific options
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True  # Enable OCR for scanned documents

custom_converter = DocumentConverter(
    format_options={
        "pdf": PdfFormatOption(pipeline_options=pipeline_options)
    }
)

# Use custom converter in DoclingConverter
docling_converter = DoclingConverter(
    converter=custom_converter,
    convert_kwargs={},
)

result = docling_converter.run(paths=["scanned_document.pdf"])
print(f"Converted {len(result['documents'])} documents with OCR")
```

--------------------------------

### Run Pre-commit Checks On-Demand

Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md

Manually runs all configured pre-commit checks on all files in the repository. This is useful for checking code style before committing.

```bash
pre-commit run --all-files
```

--------------------------------

### DoclingConverter Usage with Haystack

Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_milvus.ipynb

This snippet shows how to import and potentially use the DoclingConverter. Uncomment the import when the package is available on PyPI.

```python
# TODO: uncomment when package available on PyPI:

```

--------------------------------

### ExportType Configuration for DoclingConverter

Source: https://context7.com/docling-project/docling-haystack/llms.txt

Illustrates how to configure the ExportType for DoclingConverter. DOC_CHUNKS mode creates multiple chunks per document, while MARKDOWN mode creates a single document per input file with Markdown content.

```python
from docling_haystack.converter import DoclingConverter, ExportType

# DOC_CHUNKS mode (default) - creates multiple chunks per document
chunked_converter = DoclingConverter(
    export_type=ExportType.DOC_CHUNKS,
)
result = chunked_converter.run(paths=["document.pdf"])
# Returns multiple Document objects, one per chunk
print(f"Number of chunks: {len(result['documents'])}")  # e.g., 54 chunks

# MARKDOWN mode - creates one document per input file
markdown_converter = DoclingConverter(
    export_type=ExportType.MARKDOWN,
    md_export_kwargs={"image_placeholder": ""},  # Customize markdown export
)
result = markdown_converter.run(paths=["document.pdf"])
# Returns one Document object containing full markdown
print(f"Number of documents: {len(result['documents'])}")  # 1 document
print(f"Full markdown content: {result['documents'][0].content[:500]}...")
```

--------------------------------

### Implement Custom MetaExtractor

Source: https://context7.com/docling-project/docling-haystack/llms.txt

Create a custom metadata extractor by subclassing BaseMetaExtractor to define how metadata is captured from document chunks and the full document. This is useful for extracting application-specific fields.

```python
from typing import Any
from docling.chunking import BaseChunk
from docling.datamodel.document import DoclingDocument
from docling_haystack.converter import (
    DoclingConverter,
    BaseMetaExtractor,
    ExportType,
)

class CustomMetaExtractor(BaseMetaExtractor):
    """Extract custom metadata from documents and chunks."""

    def extract_chunk_meta(self, chunk: BaseChunk) -> dict[str, Any]:
        """Extract metadata from a document chunk."""
        meta = {"dl_meta": chunk.export_json_dict()}
        # Add custom fields
        if hasattr(chunk, 'meta') and chunk.meta.headings:
            meta["section"] = " / ".join(chunk.meta.headings)
        return meta

    def extract_dl_doc_meta(self, dl_doc: DoclingDocument) -> dict[str, Any]:
        """Extract metadata from the full document."""
        meta = {}
        if dl_doc.origin:
            meta["dl_meta"] = {"origin": dl_doc.origin.model_dump(exclude_none=True)}
            meta["filename"] = dl_doc.origin.filename
        return meta

# Use custom meta extractor
converter = DoclingConverter(
    export_type=ExportType.DOC_CHUNKS,
    meta_extractor=CustomMetaExtractor(),
)

result = converter.run(paths=["document.pdf"])
for doc in result["documents"]:
    print(f"Section: {doc.meta.get('section', 'N/A')}")
```

--------------------------------

### Process and Display Document Chunks with Docling

Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_milvus.ipynb

Use this code to print question, answer, and source details from a Docling-Haystack result. It iterates through sources, extracting and formatting metadata like text, filename, section, page number, and bounding box based on the export type.

```python
from docling.chunking import DocChunk

print(f"Question:\n{QUESTION}\n")
print(f"Answer:\n{rag_res['answer_builder']['answers'][0].data.strip()}\n")
print("Sources:")
sources = rag_res["answer_builder"]["answers"][0].documents
for source in sources:
    if EXPORT_TYPE == ExportType.DOC_CHUNKS:
        doc_chunk = DocChunk.model_validate(source.meta["dl_meta"])
        print(f"- text: {repr(doc_chunk.text)}")
        if doc_chunk.meta.origin:
            print(f"  file: {doc_chunk.meta.origin.filename}")
        if doc_chunk.meta.headings:
            print(f"  section: {' / '.join(doc_chunk.meta.headings)}")
        bbox = doc_chunk.meta.doc_items[0].prov[0].bbox
        print(
            f"  page: {doc_chunk.meta.doc_items[0].prov[0].page_no}, "
            f"bounding box: [{int(bbox.l)}, {int(bbox.t)}, {int(bbox.r)}, {int(bbox.b)}]")
    elif EXPORT_TYPE == ExportType.MARKDOWN:
        print(repr(source.content))
    else:
        raise ValueError(f"Unexpected export type: {EXPORT_TYPE}")
```

--------------------------------

### Python Callable Interface for Model Implementations

Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_basic.ipynb

Model implementations must satisfy the Python Callable interface. The __call__ method should accept an iterator of page objects and return an iterator of augmented page objects.

```python
class BaseModelPipeline:
    ...

    def __call__(self, pages: Iterator[Page]) -> Iterator[Page]:
        ...

```

--------------------------------

### Git Commit with DCO

Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md

Use this command to automatically include the DCO sign-off line in your commit message when committing changes to your local git repository.

```bash
git commit -s
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.