### Install Dependencies Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_milvus.ipynb Installs necessary packages for Docling and Haystack. Restart the kernel if prompted. ```bash %pip install -q --progress-bar off docling-haystack haystack-ai docling pymilvus milvus-haystack sentence-transformers python-dotenv ``` -------------------------------- ### Install Docling and Haystack Packages Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_basic.ipynb Install the necessary packages for Docling and Haystack. A kernel restart might be required after installation. ```bash %pip install -q --progress-bar off docling-haystack haystack-ai docling python-dotenv ``` ```bash %pip install -q --progress-bar off haystack-ai docling python-dotenv ``` -------------------------------- ### Install Project Dependencies with Poetry Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md Installs all the dependencies listed in your pyproject.toml file. Ensure the virtual environment is activated first. ```bash poetry install ``` -------------------------------- ### Install Poetry Globally Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md This command installs Poetry globally on your machine using the official installer script. Ensure you note the installation bin folder for the next steps. ```bash curl -sSL https://install.python-poetry.org | python3 - ``` -------------------------------- ### License Header Example Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md Include this header at the beginning of each source file to comply with the MIT Software license. It uses the SPDX format for identification. ```text /* Copyright IBM Inc. All rights reserved. SPDX-License-Identifier: MIT */ ``` -------------------------------- ### Install Pre-commit Hooks Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md Installs the git hooks managed by pre-commit. These hooks run automatically on commits to enforce coding standards. ```bash pre-commit install ``` -------------------------------- ### Add a New Dependency with Poetry Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md Adds a new package to your project's dependencies and installs it. Replace `NAME` with the actual package name. ```bash poetry add NAME ``` -------------------------------- ### Convert Single PDF to Markdown Source: https://github.com/docling-project/docling-haystack/blob/main/test/data/2408.09869v5.md Basic usage example for converting a single PDF document from a URL or file path to Markdown format. Ensure the 'docling' package is installed. ```python from docling.document_converter import DocumentConverter source = "https ://arxiv.org/pdf /2206.01062" # PDF path or URL converter = DocumentConverter () result = converter.convert_single(source) print(result.render_as_markdown ()) # output: "## DocLayNet: A Large Human -Annotated Dataset for Document -Layout Analysis [...]" ``` -------------------------------- ### Add Poetry to Bash Path Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md Append this line to your ~/.bashrc file to ensure Poetry is accessible in your Bash shell. Replace POETRY_BIN with the actual installation bin folder. ```bash echo 'export PATH="POETRY_BIN:$PATH"' >> ~/.bashrc ``` -------------------------------- ### Add Poetry to Zsh Path Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md Append this line to your ~/.zshrc file to ensure Poetry is accessible in your Zsh shell. Replace POETRY_BIN with the actual installation bin folder. ```bash echo 'export PATH="POETRY_BIN:$PATH"' >> ~/.zshrc ``` -------------------------------- ### DCO Sign-off Example Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md Developers must include a sign-off statement in their commit messages to indicate acceptance of the Developer's Certificate of Origin (DCO). ```text Signed-off-by: John Doe ``` -------------------------------- ### Initialize Document Store and Indexing Pipeline Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_basic.ipynb Set up an in-memory document store and create a pipeline to convert documents using DoclingConverter and write them to the store. ```python from haystack import Pipeline from haystack.components.writers import DocumentWriter from haystack.document_stores.in_memory import InMemoryDocumentStore from docling_haystack.converter import DoclingConverter document_store = InMemoryDocumentStore() idx_pipe = Pipeline() idx_pipe.add_component("converter", DoclingConverter()) idx_pipe.add_component("writer", DocumentWriter(document_store=document_store)) idx_pipe.connect("converter", "writer") idx_pipe.run({"converter": {"paths": PATHS}}) ``` -------------------------------- ### Configuration and Constants Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_milvus.ipynb Sets up configuration variables including API keys, document paths, model IDs, and Milvus connection details. ```python import os from pathlib import Path from tempfile import mkdtemp from docling_haystack.converter import ExportType HF_TOKEN = os.getenv("HF_API_KEY", "") PATHS = [ "https://arxiv.org/pdf/2408.09869", # Docling Technical Report # ... additional docs can be listed here ] EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2" GENERATION_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1" EXPORT_TYPE = ExportType.DOC_CHUNKS QUESTION = "Which are the main AI models in Docling?" TOP_K = 3 MILVUS_URI = str(Path(mkdtemp()) / "docling.db") ``` -------------------------------- ### Initialize Document Store and Indexing Pipeline Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_milvus.ipynb Sets up the Milvus document store and a Haystack pipeline for converting, embedding, and writing documents. Handles different export types and document splitting. ```python from docling.chunking import HybridChunker from haystack import Pipeline from haystack.components.embedders import ( SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder, ) from haystack.components.preprocessors import DocumentSplitter from haystack.components.writers import DocumentWriter from milvus_haystack import MilvusDocumentStore, MilvusEmbeddingRetriever from docling_haystack.converter import DoclingConverter document_store = MilvusDocumentStore( connection_args={"uri": MILVUS_URI}, drop_old=True, text_field="txt", # set for preventing conflict with same-name metadata field ) idx_pipe = Pipeline() idx_pipe.add_component( "converter", DoclingConverter( export_type=EXPORT_TYPE, chunker=HybridChunker(tokenizer=EMBED_MODEL_ID), ), ) idx_pipe.add_component( "embedder", SentenceTransformersDocumentEmbedder(model=EMBED_MODEL_ID), ) idx_pipe.add_component("writer", DocumentWriter(document_store=document_store)) if EXPORT_TYPE == ExportType.DOC_CHUNKS: idx_pipe.connect("converter", "embedder") elif EXPORT_TYPE == ExportType.MARKDOWN: idx_pipe.add_component( "splitter", DocumentSplitter(split_by="sentence", split_length=1), ) idx_pipe.connect("converter.documents", "splitter.documents") idx_pipe.connect("splitter.documents", "embedder.documents") else: raise ValueError(f"Unexpected export type: {EXPORT_TYPE}") idx_pipe.connect("embedder", "writer") idx_pipe.run({"converter": {"paths": PATHS}}) ``` -------------------------------- ### Configure API Key and Document Paths Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_basic.ipynb Set up the Hugging Face API key and define the list of document paths for processing. ```python import os HF_TOKEN = os.getenv("HF_API_KEY", "") PATHS = [ "https://arxiv.org/pdf/2408.09869", # Docling Technical Report # ... additional docs can be listed here ] GENERATION_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1" QUESTION = "Which are the main AI models in Docling?" TOP_K = 3 ``` -------------------------------- ### Basic DoclingConverter Usage Source: https://context7.com/docling-project/docling-haystack/llms.txt Demonstrates basic usage of DoclingConverter with default settings (DOC_CHUNKS mode) to convert documents from URLs and local paths into a Haystack document store. ```python from pathlib import Path from docling.chunking import HybridChunker from docling.document_converter import DocumentConverter from haystack import Pipeline from haystack.components.writers import DocumentWriter from haystack.document_stores.in_memory import InMemoryDocumentStore from docling_haystack.converter import DoclingConverter, ExportType # Basic usage with default settings (DOC_CHUNKS mode) document_store = InMemoryDocumentStore() idx_pipe = Pipeline() idx_pipe.add_component("converter", DoclingConverter()) idx_pipe.add_component("writer", DocumentWriter(document_store=document_store)) idx_pipe.connect("converter", "writer") # Convert documents from URLs or local paths result = idx_pipe.run({ "converter": { "paths": [ "https://arxiv.org/pdf/2408.09869", # URL "/path/to/local/document.pdf", # Local path ] } }) # Output: {'writer': {'documents_written': 54}} # Access converted documents for doc in document_store.filter_documents(): print(f"Content: {doc.content[:100]}...") print(f"Metadata: {doc.meta}") ``` -------------------------------- ### Use a Specific Python Version with Poetry Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md Creates a new virtual environment using a specified Python interpreter. Replace `$(which python3.9)` with the desired Python executable path. ```bash poetry env use $(which python3.9) ``` -------------------------------- ### RAG Pipeline with InMemoryBM25Retriever Source: https://context7.com/docling-project/docling-haystack/llms.txt Set up a RAG pipeline using an in-memory document store and BM25 retriever. Ensure necessary components like Pipeline, PromptBuilder, HuggingFaceAPIGenerator, and AnswerBuilder are imported. ```python prompt_template = """ Given these documents, answer the question. Documents: {% for doc in documents %} {{ doc.content }} {% endfor %} Question: {{query}} Answer: """ rag_pipe = Pipeline() rag_pipe.add_component( "retriever", InMemoryBM25Retriever(document_store=document_store, top_k=TOP_K), ) rag_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template)) rag_pipe.add_component( "llm", HuggingFaceAPIGenerator( api_type="serverless_inference_api", api_params={"model": GENERATION_MODEL_ID}, token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None, ), ) rag_pipe.add_component("answer_builder", AnswerBuilder()) rag_pipe.connect("retriever", "prompt_builder.documents") rag_pipe.connect("prompt_builder", "llm") rag_pipe.connect("llm.replies", "answer_builder.replies") rag_pipe.connect("llm.meta", "answer_builder.meta") rag_pipe.connect("retriever", "answer_builder.documents") # Run query rag_res = rag_pipe.run({ "retriever": {"query": QUESTION}, "prompt_builder": {"query": QUESTION}, "answer_builder": {"query": QUESTION}, }) # Extract answer and sources with rich metadata print(f"Question: {QUESTION}") print(f"Answer: {rag_res['answer_builder']['answers'][0].data.strip()}") print("\nSources:") for source in rag_res["answer_builder"]["answers"][0].documents: doc_chunk = DocChunk.model_validate(source.meta["dl_meta"]) print(f"- Text: {repr(doc_chunk.text[:100])}...") if doc_chunk.meta.origin: print(f" File: {doc_chunk.meta.origin.filename}") if doc_chunk.meta.headings: print(f" Section: {' / '.join(doc_chunk.meta.headings)}") bbox = doc_chunk.meta.doc_items[0].prov[0].bbox print(f" Page: {doc_chunk.meta.doc_items[0].prov[0].page_no}, " f"BBox: [{int(bbox.l)}, {int(bbox.t)}, {int(bbox.r)}, {int(bbox.b)}]") ``` -------------------------------- ### Initialize RAG Pipeline Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_milvus.ipynb Constructs a Retrieval-Augmented Generation (RAG) pipeline using Haystack components for embedding, retrieval, prompt building, and generation. ```python from haystack.components.builders import AnswerBuilder from haystack.components.builders.prompt_builder import PromptBuilder from haystack.components.generators import HuggingFaceAPIGenerator from haystack.utils import Secret prompt_template = """ Given these documents, answer the question. Documents: {% for doc in documents %} {{ doc.content }} {% endfor %} Question: {{query}} Answer: """ rag_pipe = Pipeline() rag_pipe.add_component( "embedder", SentenceTransformersTextEmbedder(model=EMBED_MODEL_ID), ) rag_pipe.add_component( "retriever", MilvusEmbeddingRetriever(document_store=document_store, top_k=TOP_K), ) rag_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template)) rag_pipe.add_component( "llm", HuggingFaceAPIGenerator( api_type="serverless_inference_api", api_params={"model": GENERATION_MODEL_ID}, token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None, ), ) rag_pipe.add_component("answer_builder", AnswerBuilder()) rag_pipe.connect("embedder.embedding", "retriever") rag_pipe.connect("retriever", "prompt_builder.documents") rag_pipe.connect("prompt_builder", "llm") rag_pipe.connect("llm.replies", "answer_builder.replies") rag_pipe.connect("llm.meta", "answer_builder.meta") rag_pipe.connect("retriever", "answer_builder.documents") rag_res = rag_pipe.run( { "embedder": {"text": QUESTION}, "prompt_builder": {"query": QUESTION}, "answer_builder": {"query": QUESTION}, } ) ``` -------------------------------- ### Build RAG Pipeline with Semantic Search Source: https://context7.com/docling-project/docling-haystack/llms.txt This snippet demonstrates setting up a RAG pipeline in Haystack, including components for embedding, retrieval, prompt building, generation, and answer construction. It connects these components to form a functional pipeline for semantic search and question answering. ```python prompt_template = """ Given these documents, answer the question. Documents: {% for doc in documents %} {{ doc.content }} {% endfor %} Question: {{query}} Answer: """ rag_pipe = Pipeline() rag_pipe.add_component( "embedder", SentenceTransformersTextEmbedder(model=EMBED_MODEL_ID), ) rag_pipe.add_component( "retriever", MilvusEmbeddingRetriever(document_store=document_store, top_k=TOP_K), ) rag_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template)) rag_pipe.add_component( "llm", HuggingFaceAPIGenerator( api_type="serverless_inference_api", api_params={"model": GENERATION_MODEL_ID}, token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None, ), ) rag_pipe.add_component("answer_builder", AnswerBuilder()) rag_pipe.connect("embedder.embedding", "retriever") rag_pipe.connect("retriever", "prompt_builder.documents") rag_pipe.connect("prompt_builder", "llm") rag_pipe.connect("llm.replies", "answer_builder.replies") rag_pipe.connect("llm.meta", "answer_builder.meta") rag_pipe.connect("retriever", "answer_builder.documents") # Run semantic search query rag_res = rag_pipe.run({ "embedder": {"text": QUESTION}, "prompt_builder": {"query": QUESTION}, "answer_builder": {"query": QUESTION}, }) print(f"Answer: {rag_res['answer_builder']['answers'][0].data.strip()}") ``` -------------------------------- ### Custom Chunker Configuration with HybridChunker Source: https://context7.com/docling-project/docling-haystack/llms.txt Shows how to configure DoclingConverter with a custom chunker, specifically HybridChunker, and how to specify a tokenizer to match the embedding model. The resulting chunks contain rich metadata for grounding. ```python from docling.chunking import HybridChunker from docling_haystack.converter import DoclingConverter, ExportType # Configure chunker with specific tokenizer (should match your embedding model) EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2" converter = DoclingConverter( export_type=ExportType.DOC_CHUNKS, chunker=HybridChunker(tokenizer=EMBED_MODEL_ID), ) result = converter.run(paths=["https://arxiv.org/pdf/2408.09869"]) documents = result["documents"] # Each chunk contains rich metadata for grounding for doc in documents[:3]: print(f"Chunk: {doc.content[:100]}...") print(f"Metadata keys: {doc.meta.keys()}") # Metadata includes: dl_meta with origin, headings, page numbers, bounding boxes ``` -------------------------------- ### Build RAG Pipeline with DoclingConverter Source: https://context7.com/docling-project/docling-haystack/llms.txt Demonstrates a complete RAG pipeline using DoclingConverter for document ingestion into an InMemoryDocumentStore, followed by BM25 retrieval and generation. ```python import os from dotenv import load_dotenv from haystack import Pipeline from haystack.components.builders import AnswerBuilder from haystack.components.builders.prompt_builder import PromptBuilder from haystack.components.generators import HuggingFaceAPIGenerator from haystack.components.retrievers.in_memory import InMemoryBM25Retriever from haystack.components.writers import DocumentWriter from haystack.document_stores.in_memory import InMemoryDocumentStore from haystack.utils import Secret from docling.chunking import DocChunk from docling_haystack.converter import DoclingConverter load_dotenv() # Configuration HF_TOKEN = os.getenv("HF_API_KEY", "") PATHS = ["https://arxiv.org/pdf/2408.09869"] GENERATION_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1" QUESTION = "Which are the main AI models in Docling?" TOP_K = 3 # Indexing Pipeline document_store = InMemoryDocumentStore() idx_pipe = Pipeline() idx_pipe.add_component("converter", DoclingConverter()) idx_pipe.add_component("writer", DocumentWriter(document_store=document_store)) idx_pipe.connect("converter", "writer") idx_pipe.run({"converter": {"paths": PATHS}}) ``` -------------------------------- ### Activate Poetry Shell Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md Use this command to activate the virtual environment managed by Poetry. If the environment does not exist, Poetry will create it. ```bash poetry shell ``` -------------------------------- ### RAG Pipeline with Milvus and Sentence Transformers Source: https://context7.com/docling-project/docling-haystack/llms.txt Configure a RAG pipeline for production using Milvus for document storage and Sentence Transformers for embeddings. This pipeline includes document conversion, embedding, and writing to the Milvus store. ```python import os from pathlib import Path from tempfile import mkdtemp from dotenv import load_dotenv from docling.chunking import HybridChunker from haystack import Pipeline from haystack.components.builders import AnswerBuilder from haystack.components.builders.prompt_builder import PromptBuilder from haystack.components.embedders import ( SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder, ) from haystack.components.generators import HuggingFaceAPIGenerator from haystack.components.writers import DocumentWriter from haystack.utils import Secret from milvus_haystack import MilvusDocumentStore, MilvusEmbeddingRetriever from docling_haystack.converter import DoclingConverter, ExportType load_dotenv() # Configuration HF_TOKEN = os.getenv("HF_API_KEY", "") PATHS = ["https://arxiv.org/pdf/2408.09869"] EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2" GENERATION_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1" QUESTION = "Which are the main AI models in Docling?" TOP_K = 3 MILVUS_URI = str(Path(mkdtemp()) / "docling.db") # Setup Milvus document store document_store = MilvusDocumentStore( connection_args={"uri": MILVUS_URI}, drop_old=True, text_field="txt", # Prevent conflict with metadata field named 'content' ) # Indexing Pipeline with embeddings idx_pipe = Pipeline() idx_pipe.add_component( "converter", DoclingConverter( export_type=ExportType.DOC_CHUNKS, chunker=HybridChunker(tokenizer=EMBED_MODEL_ID), ), ) idx_pipe.add_component( "embedder", SentenceTransformersDocumentEmbedder(model=EMBED_MODEL_ID), ) idx_pipe.add_component("writer", DocumentWriter(document_store=document_store)) idx_pipe.connect("converter", "embedder") idx_pipe.connect("embedder", "writer") idx_pipe.run({"converter": {"paths": PATHS}}) # Output: {'writer': {'documents_written': 54}} ``` -------------------------------- ### Build and Run RAG Pipeline Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_basic.ipynb Construct a Retrieval-Augmented Generation (RAG) pipeline using BM25Retriever, PromptBuilder, HuggingFaceAPIGenerator, and AnswerBuilder. ```python from haystack.components.builders import AnswerBuilder from haystack.components.builders.prompt_builder import PromptBuilder from haystack.components.generators import HuggingFaceAPIGenerator from haystack.components.retrievers.in_memory import InMemoryBM25Retriever from haystack.utils import Secret prompt_template = """ Given these documents, answer the question. Documents: {% for doc in documents %} {{ doc.content }} {% endfor %} Question: {{query}} Answer: """ rag_pipe = Pipeline() rag_pipe.add_component( "retriever", InMemoryBM25Retriever(document_store=document_store, top_k=TOP_K), ) rag_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template)) rag_pipe.add_component( "llm", HuggingFaceAPIGenerator( api_type="serverless_inference_api", api_params={"model": GENERATION_MODEL_ID}, token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None, ), ) rag_pipe.add_component("answer_builder", AnswerBuilder()) rag_pipe.connect("retriever", "prompt_builder.documents") rag_pipe.connect("prompt_builder", "llm") rag_pipe.connect("llm.replies", "answer_builder.replies") rag_pipe.connect("llm.meta", "answer_builder.meta") rag_pipe.connect("retriever", "answer_builder.documents") rag_res = rag_pipe.run( { "retriever": {"query": QUESTION}, "prompt_builder": {"query": QUESTION}, "answer_builder": {"query": QUESTION}, } ) ``` -------------------------------- ### Display Question, Answer, and Sources Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_basic.ipynb Print the question, the generated answer, and the sources used to generate the answer, extracting metadata from DocChunk. ```python from docling.chunking import DocChunk print(f"Question:\n{QUESTION}\n") print(f"Answer:\n{rag_res['answer_builder']['answers'][0].data.strip()}\n") print("Sources:") sources = rag_res["answer_builder"]["answers"][0].documents for source in sources: doc_chunk = DocChunk.model_validate(source.meta["dl_meta"]) print(f"- text: {repr(doc_chunk.text)}") if doc_chunk.meta.origin: print(f" file: {doc_chunk.meta.origin.filename}") if doc_chunk.meta.headings: print(f" section: {' / '.join(doc_chunk.meta.headings)}") bbox = doc_chunk.meta.doc_items[0].prov[0].bbox print( f" page: {doc_chunk.meta.doc_items[0].prov[0].page_no}, " f"bounding box: [{int(bbox.l)}, {int(bbox.t)}, {int(bbox.r)}, {int(bbox.b)}]") ``` -------------------------------- ### Load Environment Variables Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_basic.ipynb Load environment variables, particularly the Hugging Face API key, using python-dotenv. ```python from dotenv import load_dotenv _ = load_dotenv() ``` -------------------------------- ### Configure Custom DocumentConverter for PDF Source: https://context7.com/docling-project/docling-haystack/llms.txt Set up a DocumentConverter with specific PDF format options, such as enabling OCR. This allows for customized document processing before ingestion. ```python from docling.document_converter import DocumentConverter, PdfFormatOption from docling.datamodel.pipeline_options import PdfPipelineOptions from docling_haystack.converter import DoclingConverter # Configure custom DocumentConverter with specific options pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True # Enable OCR for scanned documents custom_converter = DocumentConverter( format_options={ "pdf": PdfFormatOption(pipeline_options=pipeline_options) } ) # Use custom converter in DoclingConverter docling_converter = DoclingConverter( converter=custom_converter, convert_kwargs={}, ) result = docling_converter.run(paths=["scanned_document.pdf"]) print(f"Converted {len(result['documents'])} documents with OCR") ``` -------------------------------- ### Run Pre-commit Checks On-Demand Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md Manually runs all configured pre-commit checks on all files in the repository. This is useful for checking code style before committing. ```bash pre-commit run --all-files ``` -------------------------------- ### DoclingConverter Usage with Haystack Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_milvus.ipynb This snippet shows how to import and potentially use the DoclingConverter. Uncomment the import when the package is available on PyPI. ```python # TODO: uncomment when package available on PyPI: ``` -------------------------------- ### ExportType Configuration for DoclingConverter Source: https://context7.com/docling-project/docling-haystack/llms.txt Illustrates how to configure the ExportType for DoclingConverter. DOC_CHUNKS mode creates multiple chunks per document, while MARKDOWN mode creates a single document per input file with Markdown content. ```python from docling_haystack.converter import DoclingConverter, ExportType # DOC_CHUNKS mode (default) - creates multiple chunks per document chunked_converter = DoclingConverter( export_type=ExportType.DOC_CHUNKS, ) result = chunked_converter.run(paths=["document.pdf"]) # Returns multiple Document objects, one per chunk print(f"Number of chunks: {len(result['documents'])}") # e.g., 54 chunks # MARKDOWN mode - creates one document per input file markdown_converter = DoclingConverter( export_type=ExportType.MARKDOWN, md_export_kwargs={"image_placeholder": ""}, # Customize markdown export ) result = markdown_converter.run(paths=["document.pdf"]) # Returns one Document object containing full markdown print(f"Number of documents: {len(result['documents'])}") # 1 document print(f"Full markdown content: {result['documents'][0].content[:500]}...") ``` -------------------------------- ### Implement Custom MetaExtractor Source: https://context7.com/docling-project/docling-haystack/llms.txt Create a custom metadata extractor by subclassing BaseMetaExtractor to define how metadata is captured from document chunks and the full document. This is useful for extracting application-specific fields. ```python from typing import Any from docling.chunking import BaseChunk from docling.datamodel.document import DoclingDocument from docling_haystack.converter import ( DoclingConverter, BaseMetaExtractor, ExportType, ) class CustomMetaExtractor(BaseMetaExtractor): """Extract custom metadata from documents and chunks.""" def extract_chunk_meta(self, chunk: BaseChunk) -> dict[str, Any]: """Extract metadata from a document chunk.""" meta = {"dl_meta": chunk.export_json_dict()} # Add custom fields if hasattr(chunk, 'meta') and chunk.meta.headings: meta["section"] = " / ".join(chunk.meta.headings) return meta def extract_dl_doc_meta(self, dl_doc: DoclingDocument) -> dict[str, Any]: """Extract metadata from the full document.""" meta = {} if dl_doc.origin: meta["dl_meta"] = {"origin": dl_doc.origin.model_dump(exclude_none=True)} meta["filename"] = dl_doc.origin.filename return meta # Use custom meta extractor converter = DoclingConverter( export_type=ExportType.DOC_CHUNKS, meta_extractor=CustomMetaExtractor(), ) result = converter.run(paths=["document.pdf"]) for doc in result["documents"]: print(f"Section: {doc.meta.get('section', 'N/A')}") ``` -------------------------------- ### Process and Display Document Chunks with Docling Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_milvus.ipynb Use this code to print question, answer, and source details from a Docling-Haystack result. It iterates through sources, extracting and formatting metadata like text, filename, section, page number, and bounding box based on the export type. ```python from docling.chunking import DocChunk print(f"Question:\n{QUESTION}\n") print(f"Answer:\n{rag_res['answer_builder']['answers'][0].data.strip()}\n") print("Sources:") sources = rag_res["answer_builder"]["answers"][0].documents for source in sources: if EXPORT_TYPE == ExportType.DOC_CHUNKS: doc_chunk = DocChunk.model_validate(source.meta["dl_meta"]) print(f"- text: {repr(doc_chunk.text)}") if doc_chunk.meta.origin: print(f" file: {doc_chunk.meta.origin.filename}") if doc_chunk.meta.headings: print(f" section: {' / '.join(doc_chunk.meta.headings)}") bbox = doc_chunk.meta.doc_items[0].prov[0].bbox print( f" page: {doc_chunk.meta.doc_items[0].prov[0].page_no}, " f"bounding box: [{int(bbox.l)}, {int(bbox.t)}, {int(bbox.r)}, {int(bbox.b)}]") elif EXPORT_TYPE == ExportType.MARKDOWN: print(repr(source.content)) else: raise ValueError(f"Unexpected export type: {EXPORT_TYPE}") ``` -------------------------------- ### Python Callable Interface for Model Implementations Source: https://github.com/docling-project/docling-haystack/blob/main/examples/example_basic.ipynb Model implementations must satisfy the Python Callable interface. The __call__ method should accept an iterator of page objects and return an iterator of augmented page objects. ```python class BaseModelPipeline: ... def __call__(self, pages: Iterator[Page]) -> Iterator[Page]: ... ``` -------------------------------- ### Git Commit with DCO Source: https://github.com/docling-project/docling-haystack/blob/main/CONTRIBUTING.md Use this command to automatically include the DCO sign-off line in your commit message when committing changes to your local git repository. ```bash git commit -s ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.