# Surya

Surya is a powerful document OCR toolkit designed for accurate text extraction and document analysis across 90+ languages. Built on PyTorch and modern transformer architectures, it provides comprehensive capabilities for OCR, text line detection, document layout analysis, reading order detection, table structure recognition, and LaTeX equation OCR. The toolkit is optimized for document images and PDFs, supporting both CPU and GPU inference with configurable batch processing.

The library offers a modular predictor-based architecture where each task (detection, recognition, layout, table recognition) has its own predictor class that handles model loading, preprocessing, and inference. Surya's foundation model architecture enables high-quality OCR with character-level bounding boxes, math formula recognition, and multi-token prediction for improved performance. The toolkit supports various input formats including images, PDFs, and folders of documents.

## Text Detection with DetectionPredictor

DetectionPredictor performs line-level text detection in document images, identifying bounding boxes for text regions regardless of language. It uses a semantic segmentation model to generate heatmaps that are post-processed into precise polygon bounding boxes with confidence scores.

```python
from PIL import Image
from surya.detection import DetectionPredictor

# Initialize the detection predictor (downloads model automatically)
det_predictor = DetectionPredictor()

# Load and process a single image
image = Image.open("document.png")
predictions = det_predictor([image])

# Access detection results for the first image
result = predictions[0]
print(f"Found {len(result.bboxes)} text regions")

for bbox in result.bboxes:
    # bbox.polygon contains 4 corner points: [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
    print(f"Polygon: {bbox.polygon}")
    print(f"Bounding box: {bbox.bbox}")  # [x1, y1, x2, y2] format
    print(f"Confidence: {bbox.confidence:.2f}")

# Process multiple images with custom batch size
images = [Image.open(f"page_{i}.png") for i in range(10)]
batch_predictions = det_predictor(images, batch_size=8)

# Include heatmaps in output for debugging
predictions_with_maps = det_predictor([image], include_maps=True)
heatmap = predictions_with_maps[0].heatmap
affinity_map = predictions_with_maps[0].affinity_map
```

## OCR Text Recognition with RecognitionPredictor

RecognitionPredictor performs end-to-end OCR by combining text detection with recognition. It requires a FoundationPredictor for the recognition model and optionally a DetectionPredictor for automatic line detection. The predictor returns detailed results including text, confidence scores, character-level bounding boxes, and word segmentation.

```python
from PIL import Image
from surya.foundation import FoundationPredictor
from surya.recognition import RecognitionPredictor
from surya.detection import DetectionPredictor

# Initialize predictors
foundation_predictor = FoundationPredictor()
rec_predictor = RecognitionPredictor(foundation_predictor)
det_predictor = DetectionPredictor()

# Full OCR pipeline: detect lines then recognize text
image = Image.open("document.png")
predictions = rec_predictor(
    [image],
    det_predictor=det_predictor,  # Automatic line detection
    math_mode=True,               # Enable math formula recognition
    return_words=True,            # Include word-level results
    sort_lines=True               # Sort lines by reading order
)

# Access OCR results
result = predictions[0]
for line in result.text_lines:
    print(f"Text: {line.text}")
    print(f"Confidence: {line.confidence:.2f}")
    print(f"Polygon: {line.polygon}")

    # Character-level details
    for char in line.chars:
        print(f"  Char: '{char.text}' at {char.bbox} (valid: {char.bbox_valid})")

    # Word-level details (when return_words=True)
    for word in line.words:
        print(f"  Word: '{word.text}' at {word.bbox}")

# OCR with pre-defined bounding boxes (skip detection)
bboxes = [[[100, 50, 400, 80], [100, 90, 400, 120]]]  # List of bbox lists per image
predictions = rec_predictor(
    [image],
    bboxes=bboxes  # Provide your own bounding boxes
)

# OCR with custom polygons
polygons = [[
    [[100, 50], [400, 50], [400, 80], [100, 80]],  # Line 1
    [[100, 90], [400, 90], [400, 120], [100, 120]]  # Line 2
]]
predictions = rec_predictor([image], polygons=polygons)

# Use high-resolution images for better accuracy
highres_image = Image.open("document_highres.png")
lowres_image = highres_image.copy()
lowres_image.thumbnail((1024, 1024))
predictions = rec_predictor(
    [lowres_image],
    det_predictor=det_predictor,
    highres_images=[highres_image]  # Recognition uses high-res
)
```

## Layout Analysis with LayoutPredictor

LayoutPredictor analyzes document structure by detecting and classifying regions such as text blocks, tables, figures, headers, footers, and more. It also determines the reading order of detected elements, making it essential for document understanding tasks.

```python
from PIL import Image
from surya.foundation import FoundationPredictor
from surya.layout import LayoutPredictor
from surya.settings import settings

# Initialize layout predictor with foundation model
foundation_predictor = FoundationPredictor(
    checkpoint=settings.LAYOUT_MODEL_CHECKPOINT
)
layout_predictor = LayoutPredictor(foundation_predictor)

# Analyze document layout
image = Image.open("document.png")
predictions = layout_predictor([image])

# Access layout results
result = predictions[0]
for box in result.bboxes:
    print(f"Label: {box.label}")       # e.g., Text, Table, Picture, SectionHeader
    print(f"Position: {box.position}")  # Reading order position
    print(f"Polygon: {box.polygon}")
    print(f"Confidence: {box.confidence:.2f}")

    # Top-k alternative labels with confidence scores
    if box.top_k:
        print(f"Alternative labels: {box.top_k}")

# Layout labels include:
# - Text, SectionHeader, Caption, Footnote
# - Table, TableOfContents, Form
# - Picture, Figure, Equation, Code
# - PageHeader, PageFooter, ListItem

# Process multiple pages
pages = [Image.open(f"page_{i}.png") for i in range(5)]
all_layouts = layout_predictor(pages, batch_size=4)

# Get top-k predictions for each element
predictions = layout_predictor([image], top_k=5)
for box in predictions[0].bboxes:
    print(f"{box.label}: {box.top_k}")
```

## Table Recognition with TableRecPredictor

TableRecPredictor extracts detailed table structure including rows, columns, and cells with their positions and spanning information. It detects header rows/columns, cell merging (colspan/rowspan), and provides precise bounding boxes for each table component.

```python
from PIL import Image
from surya.table_rec import TableRecPredictor

# Initialize table recognition predictor
table_predictor = TableRecPredictor()

# Recognize table structure (image should be cropped to table)
table_image = Image.open("table.png")
predictions = table_predictor([table_image])

# Access table structure
result = predictions[0]

# Row information
print(f"Found {len(result.rows)} rows")
for row in result.rows:
    print(f"Row {row.row_id}: bbox={row.bbox}, header={row.is_header}")

# Column information
print(f"Found {len(result.cols)} columns")
for col in result.cols:
    print(f"Column {col.col_id}: bbox={col.bbox}, header={col.is_header}")

# Cell information (after merging)
print(f"Found {len(result.cells)} cells")
for cell in result.cells:
    print(f"Cell ({cell.row_id}, {cell.col_id})")
    print(f"  Bbox: {cell.bbox}")
    print(f"  Colspan: {cell.colspan}, Rowspan: {cell.rowspan}")
    print(f"  Is header: {cell.is_header}")
    print(f"  Merge up: {cell.merge_up}, Merge down: {cell.merge_down}")

# Unmerged cells (before row/column merging)
for cell in result.unmerged_cells:
    print(f"Unmerged cell at ({cell.row_id}, {cell.col_id})")

# Process multiple table images
table_images = [Image.open(f"table_{i}.png") for i in range(3)]
all_tables = table_predictor(table_images, batch_size=4)
```

## Foundation Model Direct Usage

FoundationPredictor is the core model for text recognition and layout analysis. It can be used directly for fine-grained control over OCR tasks, including different task modes and custom token limits.

```python
import numpy as np
from PIL import Image
from surya.foundation import FoundationPredictor
from surya.common.surya.schema import TaskNames

# Initialize foundation predictor
predictor = FoundationPredictor()

# Prepare image as numpy array
image = Image.open("document.png").convert("RGB")
image_np = np.array(image)

# Direct prediction loop with different task types
# TaskNames.ocr_with_boxes - OCR with character bounding boxes
# TaskNames.ocr_without_boxes - OCR without boxes (potentially better text)
# TaskNames.block_without_boxes - Block-level text (paragraphs, equations)
# TaskNames.layout - Layout analysis

predicted_tokens, batch_bboxes, scores, topk_probs = predictor.prediction_loop(
    images=[image_np],
    input_texts=[""],              # Optional input text hints
    task_names=[TaskNames.ocr_with_boxes],
    batch_size=32,
    max_tokens=224,                # Maximum tokens to generate
    math_mode=True,                # Enable math recognition
    drop_repeated_tokens=True,     # Filter repeated outputs
    max_sliding_window=576,        # Sliding window for long sequences
    tqdm_desc="Processing OCR"
)

# Decode tokens using processor
for idx, tokens in enumerate(predicted_tokens):
    text = predictor.processor.ocr_tokenizer.decode(
        tokens,
        task=TaskNames.ocr_with_boxes
    )
    print(f"Image {idx}: {text}")

# Access bounding box predictions
print(f"Bboxes shape: {batch_bboxes.shape}")  # [batch, max_tokens, 6]
```

## Loading Documents from Files and Folders

Surya provides convenient utilities for loading images and PDFs with automatic format detection and page range support.

```python
from surya.input.load import (
    load_from_file,
    load_from_folder,
    load_pdf,
    load_image
)
from surya.settings import settings

# Load a single PDF with specific page range
images, names = load_pdf(
    "document.pdf",
    page_range=[0, 1, 2],       # Pages 0, 1, 2
    dpi=settings.IMAGE_DPI      # Default 96 DPI
)

# Load high-resolution images for OCR
highres_images, _ = load_pdf(
    "document.pdf",
    dpi=settings.IMAGE_DPI_HIGHRES  # Default 192 DPI
)

# Load any file (auto-detects PDF vs image)
images, names = load_from_file("document.pdf", page_range=[0, 5, 10])
images, names = load_from_file("scan.png")

# Load all documents from a folder
images, names = load_from_folder(
    "documents/",
    page_range=None,            # All pages for PDFs
    dpi=settings.IMAGE_DPI
)

# Load a single image
images, names = load_image("page.jpg")
print(f"Loaded {len(images)} images: {names}")
```

## OCR Error Detection with OCRErrorPredictor

OCRErrorPredictor analyzes OCR output text to detect potential recognition errors, helping identify low-quality results that may need manual review or re-processing.

```python
from surya.ocr_error import OCRErrorPredictor

# Initialize error detection predictor
error_predictor = OCRErrorPredictor()

# Analyze OCR text for errors
texts = [
    "This is clean, well-formatted text.",
    "Th1s t3xt h@s m@ny 0CR err0rs",
    "Normal document text with proper formatting.",
    "Garbled t€xt w!th $ymbol$ replacing l3tters"
]

result = error_predictor(texts, batch_size=4)

# Access error detection results
for text, label in zip(result.texts, result.labels):
    print(f"Text: {text[:50]}...")
    print(f"Label: {label}")  # e.g., "clean" or "error"
    print()
```

## Drawing Detection Results on Images

Surya includes debug utilities for visualizing detection results, bounding boxes, and text predictions on images.

```python
from PIL import Image
from surya.detection import DetectionPredictor
from surya.debug.draw import draw_polys_on_image, draw_bboxes_on_image
from surya.debug.text import draw_text_on_image

# Run detection
det_predictor = DetectionPredictor()
image = Image.open("document.png")
predictions = det_predictor([image])

# Draw polygons on image
result_image = image.copy()
polygons = [bbox.polygon for bbox in predictions[0].bboxes]
labels = [f"{bbox.confidence:.2f}" for bbox in predictions[0].bboxes]

annotated = draw_polys_on_image(
    corners=polygons,
    image=result_image,
    labels=labels,
    label_font_size=12,
    color="red"
)
annotated.save("detection_result.png")

# Draw axis-aligned bounding boxes
bboxes = [bbox.bbox for bbox in predictions[0].bboxes]  # [x1, y1, x2, y2]
annotated = draw_bboxes_on_image(
    bboxes=bboxes,
    image=image.copy(),
    labels=labels,
    color="blue"
)
annotated.save("bbox_result.png")

# Draw OCR text results
from surya.foundation import FoundationPredictor
from surya.recognition import RecognitionPredictor

foundation_predictor = FoundationPredictor()
rec_predictor = RecognitionPredictor(foundation_predictor)
ocr_results = rec_predictor([image], det_predictor=det_predictor)

text_bboxes = [line.bbox for line in ocr_results[0].text_lines]
text_lines = [line.text for line in ocr_results[0].text_lines]
text_image = draw_text_on_image(text_bboxes, text_lines, image.size)
text_image.save("ocr_result.png")
```

## Command Line Interface

Surya provides several CLI commands for processing documents without writing code. Each command supports common options for input paths, output directories, and batch processing.

```bash
# Text Detection - outputs bounding boxes to JSON
surya_detect document.pdf --images --output_dir ./results

# OCR - full text recognition
surya_ocr document.pdf --images --output_dir ./results
surya_ocr scan.png --task_name ocr_without_boxes
surya_ocr documents/ --page_range 0,5-10,20

# Layout Analysis - document structure detection
surya_layout document.pdf --images --output_dir ./results

# Table Recognition - extract table structure
surya_table table.png --images --output_dir ./results
surya_table document.pdf --detect_boxes --skip_table_detection

# LaTeX OCR - recognize equations (requires cropped equation images)
surya_latex_ocr equation.png --output_dir ./results

# Interactive GUI applications
pip install streamlit pdftext
surya_gui  # General OCR GUI

pip install streamlit==1.40 streamlit-drawable-canvas-jsretry
texify_gui  # Equation selection and OCR GUI
```

## Configuration and Environment Variables

Surya's behavior can be customized through environment variables for batch sizes, model checkpoints, and device settings.

```python
import os
from surya.settings import settings

# Set device (auto-detected by default)
os.environ["TORCH_DEVICE"] = "cuda"  # or "cpu", "mps"

# Batch sizes for different operations
os.environ["DETECTOR_BATCH_SIZE"] = "36"      # Text detection
os.environ["RECOGNITION_BATCH_SIZE"] = "256"  # OCR recognition
os.environ["LAYOUT_BATCH_SIZE"] = "32"        # Layout analysis
os.environ["TABLE_REC_BATCH_SIZE"] = "64"     # Table recognition

# Detection thresholds
os.environ["DETECTOR_TEXT_THRESHOLD"] = "0.6"   # Text confidence
os.environ["DETECTOR_BLANK_THRESHOLD"] = "0.35" # Blank space threshold

# Enable model compilation for faster inference
os.environ["COMPILE_DETECTOR"] = "true"
os.environ["COMPILE_LAYOUT"] = "true"
os.environ["COMPILE_TABLE_REC"] = "true"
os.environ["COMPILE_ALL"] = "true"  # Compile all models

# Access current settings
print(f"Device: {settings.TORCH_DEVICE_MODEL}")
print(f"Detection batch size: {settings.DETECTOR_BATCH_SIZE}")
print(f"Model cache directory: {settings.MODEL_CACHE_DIR}")

# Disable progress bars
settings.DISABLE_TQDM = True

# Custom model checkpoints
os.environ["FOUNDATION_MODEL_CHECKPOINT"] = "s3://text_recognition/2025_09_23"
os.environ["DETECTOR_MODEL_CHECKPOINT"] = "s3://text_detection/2025_05_07"
```

## Summary

Surya excels at document-centric OCR tasks including multi-language text recognition, document structure analysis, and table extraction. Its primary use cases include digitizing scanned documents, extracting text from PDFs with complex layouts, processing forms and invoices with table data, and converting printed mathematical equations to LaTeX. The modular predictor architecture allows combining detection, recognition, and layout analysis in flexible pipelines tailored to specific document types.

Integration patterns typically involve loading documents via the input utilities, running detection for text regions, performing recognition for text extraction, and optionally analyzing layout for document understanding. For production deployments, batch processing with appropriate batch sizes for your hardware (GPU memory) is recommended, along with model compilation for optimal throughput. The toolkit integrates seamlessly with downstream document processing pipelines through its JSON output format and Pydantic-based result schemas that can be easily serialized and processed.