### Set Up Conda Environment and Install OCRFlux

Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md

These commands guide the user through creating a dedicated Conda environment to isolate OCRFlux's Python dependencies, preventing conflicts with existing environments. It then activates the newly created environment, clones the OCRFlux GitHub repository, navigates into the project directory, and finally installs OCRFlux using pip, specifying a `--find-links` option for `flashinfer` to ensure correct dependency resolution.

```bash
conda create -n ocrflux python=3.11
conda activate ocrflux

git clone https://github.com/chatdoc-com/OCRFlux.git
cd ocrflux

pip install -e . --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer/
```

--------------------------------

### Install System Dependencies for OCRFlux on Debian/Ubuntu

Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md

This command installs essential system-level utilities and fonts required for OCRFlux to function correctly on Debian or Ubuntu distributions. It includes `poppler-utils` for PDF processing and various fonts necessary for proper document rendering. The `sudo apt-get update` ensures that the package lists are current before installation.

```bash
sudo apt-get update
sudo apt-get install poppler-utils poppler-data ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
```

--------------------------------

### Run OCRFlux Pipeline for PDF Document

Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md

Executes the OCRFlux pipeline to process a single PDF document. Requires a local workspace and a pre-trained OCRFlux model. Results are stored in JSONL format.

```bash
python -m ocrflux.pipeline ./localworkspace --data test.pdf --model /model_dir/OCRFlux-3B
```

--------------------------------

### Run OCRFlux Pipeline for Image File

Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md

Executes the OCRFlux pipeline to process a single image file (e.g., PNG). Requires a local workspace and a pre-trained OCRFlux model. Results are stored in JSONL format.

```bash
python -m ocrflux.pipeline ./localworkspace --data test_page.png --model /model_dir/OCRFlux-3B
```

--------------------------------

### OCRFlux Pipeline Command-Line Interface (CLI) Reference

Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md

Documents the command-line arguments and options available for the `ocrflux.pipeline` script. It details parameters for task selection, data input, page processing, error handling, model configuration, and output settings.

```APIDOC
usage: pipeline.py [-h] [--task {pdf2markdown,merge_pages,merge_tables}] [--data [DATA ...]] [--pages_per_group PAGES_PER_GROUP] [--max_page_retries MAX_PAGE_RETRIES]
                   [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--model MODEL] [--model_max_context MODEL_MAX_CONTEXT]
                   [--model_chat_template MODEL_CHAT_TEMPLATE] [--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM] [--skip_cross_page_merge] [--port PORT]
                   workspace

Manager for running millions of PDFs through a batch inference pipeline

positional arguments:
  workspace             The filesystem path where work will be stored, can be a local folder

options:
  -h, --help            show this help message and exit
  --data [DATA ...]     List of paths to files to process
  --pages_per_group PAGES_PER_GROUP
                        Aiming for this many pdf pages per work item group
  --max_page_retries MAX_PAGE_RETRIES
                        Max number of times we will retry rendering a page
  --max_page_error_rate MAX_PAGE_ERROR_RATE
                        Rate of allowable failed pages in a document, 1/250 by default
  --workers WORKERS     Number of workers to run at a time
  --model MODEL         The path to the model
  --model_max_context MODEL_MAX_CONTEXT
                        Maximum context length that the model was fine tuned under
  --model_chat_template MODEL_CHAT_TEMPLATE
                        Chat template to pass to vllm server
  --target_longest_image_dim TARGET_LONGEST_IMAGE_DIM
                        Dimension on longest side to use for rendering the pdf pages
  --skip_cross_page_merge
                        Whether to skip cross-page merging
  --port PORT           Port to use for the VLLM server
```

--------------------------------

### Run OCRFlux Pipeline for Directory of Documents

Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md

Executes the OCRFlux pipeline to process multiple PDF or image files within a specified directory. Uses a wildcard to process all supported files. Requires a local workspace and a pre-trained OCRFlux model.

```bash
python -m ocrflux.pipeline ./localworkspace --data test_pdf_dir/* --model /model_dir/OCRFlux-3B
```

--------------------------------

### Run OCRFlux in Docker Container

Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md

Provides a Docker command to run OCRFlux within a container, leveraging GPU support. It mounts local directories for workspace, input data, and the OCRFlux model, allowing for containerized document processing.

```bash
docker run -it --gpus all \
  -v /path/to/localworkspace:/localworkspace \
  -v /path/to/test_pdf_dir:/test_pdf_dir/ \
  -v /path/to/OCRFlux-3B:/OCRFlux-3B \
  chatdoc/ocrflux:latest /localworkspace --data /test_pdf_dir/* --model /OCRFlux-3B/
```

--------------------------------

### Convert JSONL Results to Markdown

Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md

Executes a Python script to convert the JSONL output files generated by the OCRFlux pipeline into final Markdown documents. The generated Markdown files are stored in a structured directory within the local workspace.

```bash
python -m ocrflux.jsonl_to_markdown ./localworkspace
```

--------------------------------

### Directly Call OCRFlux Inference API (Python)

Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md

Demonstrates how to integrate OCRFlux directly into Python code for document parsing without an online vLLM server. It initializes an LLM model and uses the `ocrflux.inference.parse` function to process a document, saving the result as Markdown.

```python
from vllm import LLM
from ocrflux.inference import parse

file_path = 'test.pdf'
# file_path = 'test.png'
llm = LLM(model="model_dir/OCRFlux-3B",gpu_memory_utilization=0.8,max_model_len=8192)
result = parse(llm,file_path)
if result != None:
    document_markdown = result['document_text']
    print(document_markdown)
    with open('test.md','w') as f:
        f.write(document_markdown)
else:
    print("Parse failed.")
```

--------------------------------

### OCRFlux JSONL Output Schema

Source: https://github.com/chatdoc-com/ocrflux/blob/main/README.md

Defines the structure of the JSON objects stored in the output JSONL files generated by the OCRFlux pipeline. Each object represents a processed document with its original path, page count, full Markdown text, and per-page Markdown texts.

```APIDOC
{
    "orig_path": str,  # the path to the raw pdf or image file
    "num_pages": int,  # the number of pages in the pdf file
    "document_text": str, # the Markdown text of the converted pdf or image file
    "page_texts": dict, # the Markdown texts of each page in the pdf file, the key is the page index and the value is the Markdown text of the page
    "fallback_pages": [int], # the page indexes that are not converted successfully
}
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.