### Create Virtual Environment and Install Dependencies with uv Source: https://github.com/docling-project/docling-parse/blob/main/CONTRIBUTING.md Activate the virtual environment using 'uv shell' and then install project dependencies with 'uv install'. uv will create the environment if it doesn't exist. ```bash uv shell ``` ```bash uv install ``` -------------------------------- ### Install docling-parse Source: https://github.com/docling-project/docling-parse/blob/main/README.md Install the package using pip. ```sh pip install docling-parse ``` -------------------------------- ### Install Performance Extras Source: https://github.com/docling-project/docling-parse/blob/main/perf/README.md Installs optional third-party baselines for performance testing. Use either `uv sync` or `pip install` with the `perf-tools` group. ```sh uv sync --group perf-test ``` ```sh pip install .[perf-tools] ``` -------------------------------- ### Install Performance Tools (Alternative) Source: https://github.com/docling-project/docling-parse/blob/main/docs/performance_code.md Installs performance tools using pip, useful if uv is not available. ```sh pip install .[perf-tools] ``` -------------------------------- ### Install Performance Tools Source: https://github.com/docling-project/docling-parse/blob/main/docs/performance_code.md Installs necessary extras for performance testing, including non-docling baselines. ```sh uv sync --group perf-test ``` -------------------------------- ### Basic Scaling Script Execution Examples Source: https://github.com/docling-project/docling-parse/blob/main/docs/performance_code.md These examples demonstrate how to run the scaling script with different modes and thread configurations. They show basic usage for parsing, rendering, and specifying alternative backends. ```sh python perf/run_scaling.py ./dataset --mode parse ``` ```sh python perf/run_scaling.py ./dataset --mode render --threads 1,2,4,8,12,16 ``` ```sh python perf/run_scaling.py ./dataset --mode both --other "pypdfium2;pymupdf" ``` ```sh python perf/run_scaling.py ./dataset --mode render --enable-timing ``` -------------------------------- ### Install uv Standalone Source: https://github.com/docling-project/docling-parse/blob/main/CONTRIBUTING.md Use this command to install the uv dependency manager on macOS and Linux systems. Refer to uv documentation for other platforms. ```bash # On macOS and Linux. curl -LsSf https://astral.sh/uv/install.sh | sh ``` -------------------------------- ### Install Python Package with uv Source: https://github.com/docling-project/docling-parse/blob/main/README.md Synchronize Python dependencies using uv. This command assumes uv is installed and should be run after a clean git clone. ```sh uv sync ``` -------------------------------- ### Install pre-commit Hooks Source: https://github.com/docling-project/docling-parse/blob/main/CONTRIBUTING.md Install the pre-commit hooks to enforce code style checks automatically on every commit. This ensures consistency across the codebase. ```bash pre-commit install ``` -------------------------------- ### Parse-and-Render Example with DoclingThreadedPdfParser Source: https://github.com/docling-project/docling-parse/blob/main/docs/plans/threaded-api-design.md Illustrates how to use DoclingThreadedPdfParser for both parsing and rendering PDF documents. This example shows how to configure rendering options, load a document, and retrieve default, scaled, and cropped images of pages. ```python from docling_core.types.doc.base import BoundingBox, CoordOrigin from docling_parse.pdf_parser import ( DecodeConfig, DoclingThreadedPdfParser, RenderConfig, ThreadedPdfParserConfig, ) render_config = RenderConfig() render_config.canvas_width = 1024 parser = DoclingThreadedPdfParser( parser_config=ThreadedPdfParserConfig( threads=4, render_config=render_config, ), decode_config=DecodeConfig(), ) doc_key = parser.load(path) for result in parser.iterate_results(): if not result.success: continue page = result.get_page() default_image = result.get_image() scaled_image = result.get_image(scale=2.0) cropped = result.get_image( scale=2.0, cropbox=BoundingBox( l=10, t=20, r=60, b=90, coord_origin=CoordOrigin.TOPLEFT, ), ) ``` -------------------------------- ### Run PyPDFium2 Benchmark Source: https://github.com/docling-project/docling-parse/blob/main/docs/performance_code.md Benchmarks the pypdfium2 backend against a PDF file or directory. ```sh python perf/run_perf.py ./dataset -r -p pypdfium2 ``` -------------------------------- ### CLI Usage for Scaling Script Source: https://github.com/docling-project/docling-parse/blob/main/docs/performance_code.md This is the general command-line interface for running the performance scaling script. It outlines the primary input and available options for controlling the script's behavior. ```text python perf/run_scaling.py [input] [options] options: --mode {parse,render,both} --recursive, -r --max-pages, -l N --max-concurrent-results N --threads 1,2,4,8,12,16 --scale FLOAT --other "pypdfium2;pymupdf" --enable-timing / --no-enable-timing --timing-csv PATH ``` -------------------------------- ### Set Default Paths Source: https://github.com/docling-project/docling-parse/blob/main/CMakeLists.txt Sets default values for various project paths if they are not already defined. This includes installation, externals, and resources directories. ```cmake if(NOT DEFINED TOPLEVEL_PREFIX_PATH) set(TOPLEVEL_PREFIX_PATH ${CMAKE_CURRENT_SOURCE_DIR}) endif() if(NOT DEFINED CMAKE_INSTALL_PREFIX) set(CMAKE_INSTALL_PREFIX ${TOPLEVEL_PREFIX_PATH}/install_dir) endif() if(NOT DEFINED EXTERNALS_PREFIX_PATH) set(EXTERNALS_PREFIX_PATH "${TOPLEVEL_PREFIX_PATH}/externals" CACHE INTERNAL "") endif() if(NOT DEFINED RESOURCES_PREFIX_PATH) set(RESOURCES_PREFIX_PATH "${TOPLEVEL_PREFIX_PATH}/resources" CACHE INTERNAL "") endif() if(NOT "${TOPLEVEL_PREFIX_PATH}/cmake" IN_LIST CMAKE_MODULE_PATH) set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${TOPLEVEL_PREFIX_PATH}/cmake") endif() if(NOT DEFINED CMAKE_PDF_DATA_DIR) set(CMAKE_PDF_DATA_DIR "${TOPLEVEL_PREFIX_PATH}/docling_parse/pdf_resources") endif() ``` -------------------------------- ### Develop Python Package with uv (Force Reinstall) Source: https://github.com/docling-project/docling-parse/blob/main/README.md Install the Python package in editable mode with a forced reinstallation, excluding dependencies. This is useful when developing C++ code. ```sh rm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e ".[perf-tools]" ``` -------------------------------- ### Get Image Function Signature Source: https://github.com/docling-project/docling-parse/blob/main/docs/plans/threaded-api-design.md Signature for the `get_image` function, which retrieves rendered images from a parser configured for rendering. It supports optional scaling, canvas size, and cropping parameters. ```python get_image( scale: float | None = None, canvas_size: tuple[int, int] | None = None, cropbox: BoundingBox | None = None, ) -> PIL.Image.Image ``` -------------------------------- ### Display Build Information Source: https://github.com/docling-project/docling-parse/blob/main/CMakeLists.txt Logs various build-related paths and system information using the message command. This helps in debugging and understanding the build environment. ```cmake message(STATUS "cmake osx-deployment: " ${CMAKE_OSX_DEPLOYMENT_TARGET}) message(STATUS "cmake system-version: " ${CMAKE_SYSTEM_VERSION}) message(STATUS "cmake osx-deployment: " ${CMAKE_OSX_DEPLOYMENT_TARGET}) message(STATUS " top path: " ${TOPLEVEL_PREFIX_PATH}) message(STATUS " lib path: " ${EXTERNALS_PREFIX_PATH}) message(STATUS " install path: " ${CMAKE_INSTALL_PREFIX}) message(STATUS " cmake path: " ${CMAKE_MODULE_PATH}) message(STATUS " cmake system: " ${CMAKE_SYSTEM_PROCESSOR}) message(STATUS "cmake osx arch: " ${CMAKE_OSX_ARCHITECTURES}) ``` -------------------------------- ### Develop Python Package with uv (Threaded) Source: https://github.com/docling-project/docling-parse/blob/main/README.md Install the Python package in editable mode using uv, specifying the number of build threads. This is an alternative for development when C++ code is updated. ```sh BUILD_THREADS=12 uv pip install --force-reinstall --no-deps -e ".[perf]" ``` -------------------------------- ### Run PyMuPDF Benchmark Source: https://github.com/docling-project/docling-parse/blob/main/docs/performance_code.md Benchmarks the pymupdf backend against a PDF file or directory. ```sh python perf/run_perf.py ./dataset -r -p pymupdf ```