### Set Up Conda Environment and Install OCRFlux Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md These commands guide the user through creating a dedicated Conda environment to isolate OCRFlux's Python dependencies, preventing conflicts with existing environments. It then activates the newly created environment, clones the OCRFlux GitHub repository, navigates into the project directory, and finally installs OCRFlux using pip, specifying a `--find-links` option for `flashinfer` to ensure correct dependency resolution. ```bash conda create -n ocrflux python=3.11 conda activate ocrflux git clone https://github.com/chatdoc-com/OCRFlux.git cd ocrflux pip install -e . --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer/ ``` -------------------------------- ### Install System Dependencies for OCRFlux on Debian/Ubuntu Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md This command installs essential system-level utilities and fonts required for OCRFlux to function correctly on Debian or Ubuntu distributions. It includes `poppler-utils` for PDF processing and various fonts necessary for proper document rendering. The `sudo apt-get update` ensures that the package lists are current before installation. ```bash sudo apt-get update sudo apt-get install poppler-utils poppler-data ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools ``` -------------------------------- ### Run OCRFlux Pipeline for PDF Document Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md Executes the OCRFlux pipeline to process a single PDF document. Requires a local workspace and a pre-trained OCRFlux model. Results are stored in JSONL format. ```bash python -m ocrflux.pipeline ./localworkspace --data test.pdf --model /model_dir/OCRFlux-3B ``` -------------------------------- ### Run OCRFlux Pipeline for Image File Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md Executes the OCRFlux pipeline to process a single image file (e.g., PNG). Requires a local workspace and a pre-trained OCRFlux model. Results are stored in JSONL format. ```bash python -m ocrflux.pipeline ./localworkspace --data test_page.png --model /model_dir/OCRFlux-3B ``` -------------------------------- ### OCRFlux Pipeline Command-Line Interface (CLI) Reference Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md Documents the command-line arguments and options available for the `ocrflux.pipeline` script. It details parameters for task selection, data input, page processing, error handling, model configuration, and output settings. ```APIDOC usage: pipeline.py [-h] [--task {pdf2markdown,merge_pages,merge_tables}] [--data [DATA ...]] [--pages_per_group PAGES_PER_GROUP] [--max_page_retries MAX_PAGE_RETRIES] [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--model MODEL] [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE] [--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM] [--skip_cross_page_merge] [--port PORT] workspace Manager for running millions of PDFs through a batch inference pipeline positional arguments: workspace The filesystem path where work will be stored, can be a local folder options: -h, --help show this help message and exit --data [DATA ...] List of paths to files to process --pages_per_group PAGES_PER_GROUP Aiming for this many pdf pages per work item group --max_page_retries MAX_PAGE_RETRIES Max number of times we will retry rendering a page --max_page_error_rate MAX_PAGE_ERROR_RATE Rate of allowable failed pages in a document, 1/250 by default --workers WORKERS Number of workers to run at a time --model MODEL The path to the model --model_max_context MODEL_MAX_CONTEXT Maximum context length that the model was fine tuned under --model_chat_template MODEL_CHAT_TEMPLATE Chat template to pass to vllm server --target_longest_image_dim TARGET_LONGEST_IMAGE_DIM Dimension on longest side to use for rendering the pdf pages --skip_cross_page_merge Whether to skip cross-page merging --port PORT Port to use for the VLLM server ``` -------------------------------- ### Run OCRFlux Pipeline for Directory of Documents Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md Executes the OCRFlux pipeline to process multiple PDF or image files within a specified directory. Uses a wildcard to process all supported files. Requires a local workspace and a pre-trained OCRFlux model. ```bash python -m ocrflux.pipeline ./localworkspace --data test_pdf_dir/* --model /model_dir/OCRFlux-3B ``` -------------------------------- ### Run OCRFlux in Docker Container Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md Provides a Docker command to run OCRFlux within a container, leveraging GPU support. It mounts local directories for workspace, input data, and the OCRFlux model, allowing for containerized document processing. ```bash docker run -it --gpus all \ -v /path/to/localworkspace:/localworkspace \ -v /path/to/test_pdf_dir:/test_pdf_dir/ \ -v /path/to/OCRFlux-3B:/OCRFlux-3B \ chatdoc/ocrflux:latest /localworkspace --data /test_pdf_dir/* --model /OCRFlux-3B/ ``` -------------------------------- ### Convert JSONL Results to Markdown Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md Executes a Python script to convert the JSONL output files generated by the OCRFlux pipeline into final Markdown documents. The generated Markdown files are stored in a structured directory within the local workspace. ```bash python -m ocrflux.jsonl_to_markdown ./localworkspace ``` -------------------------------- ### Directly Call OCRFlux Inference API (Python) Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md Demonstrates how to integrate OCRFlux directly into Python code for document parsing without an online vLLM server. It initializes an LLM model and uses the `ocrflux.inference.parse` function to process a document, saving the result as Markdown. ```python from vllm import LLM from ocrflux.inference import parse file_path = 'test.pdf' # file_path = 'test.png' llm = LLM(model="model_dir/OCRFlux-3B",gpu_memory_utilization=0.8,max_model_len=8192) result = parse(llm,file_path) if result != None: document_markdown = result['document_text'] print(document_markdown) with open('test.md','w') as f: f.write(document_markdown) else: print("Parse failed.") ``` -------------------------------- ### OCRFlux JSONL Output Schema Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md Defines the structure of the JSON objects stored in the output JSONL files generated by the OCRFlux pipeline. Each object represents a processed document with its original path, page count, full Markdown text, and per-page Markdown texts. ```APIDOC { "orig_path": str, # the path to the raw pdf or image file "num_pages": int, # the number of pages in the pdf file "document_text": str, # the Markdown text of the converted pdf or image file "page_texts": dict, # the Markdown texts of each page in the pdf file, the key is the page index and the value is the Markdown text of the page "fallback_pages": [int], # the page indexes that are not converted successfully } ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.