# Surya Surya is a powerful document OCR toolkit designed for accurate text extraction and document analysis across 90+ languages. Built on PyTorch and modern transformer architectures, it provides comprehensive capabilities for OCR, text line detection, document layout analysis, reading order detection, table structure recognition, and LaTeX equation OCR. The toolkit is optimized for document images and PDFs, supporting both CPU and GPU inference with configurable batch processing. The library offers a modular predictor-based architecture where each task (detection, recognition, layout, table recognition) has its own predictor class that handles model loading, preprocessing, and inference. Surya's foundation model architecture enables high-quality OCR with character-level bounding boxes, math formula recognition, and multi-token prediction for improved performance. The toolkit supports various input formats including images, PDFs, and folders of documents. ## Text Detection with DetectionPredictor DetectionPredictor performs line-level text detection in document images, identifying bounding boxes for text regions regardless of language. It uses a semantic segmentation model to generate heatmaps that are post-processed into precise polygon bounding boxes with confidence scores. ```python from PIL import Image from surya.detection import DetectionPredictor # Initialize the detection predictor (downloads model automatically) det_predictor = DetectionPredictor() # Load and process a single image image = Image.open("document.png") predictions = det_predictor([image]) # Access detection results for the first image result = predictions[0] print(f"Found {len(result.bboxes)} text regions") for bbox in result.bboxes: # bbox.polygon contains 4 corner points: [[x1,y1], [x2,y2], [x3,y3], [x4,y4]] print(f"Polygon: {bbox.polygon}") print(f"Bounding box: {bbox.bbox}") # [x1, y1, x2, y2] format print(f"Confidence: {bbox.confidence:.2f}") # Process multiple images with custom batch size images = [Image.open(f"page_{i}.png") for i in range(10)] batch_predictions = det_predictor(images, batch_size=8) # Include heatmaps in output for debugging predictions_with_maps = det_predictor([image], include_maps=True) heatmap = predictions_with_maps[0].heatmap affinity_map = predictions_with_maps[0].affinity_map ``` ## OCR Text Recognition with RecognitionPredictor RecognitionPredictor performs end-to-end OCR by combining text detection with recognition. It requires a FoundationPredictor for the recognition model and optionally a DetectionPredictor for automatic line detection. The predictor returns detailed results including text, confidence scores, character-level bounding boxes, and word segmentation. ```python from PIL import Image from surya.foundation import FoundationPredictor from surya.recognition import RecognitionPredictor from surya.detection import DetectionPredictor # Initialize predictors foundation_predictor = FoundationPredictor() rec_predictor = RecognitionPredictor(foundation_predictor) det_predictor = DetectionPredictor() # Full OCR pipeline: detect lines then recognize text image = Image.open("document.png") predictions = rec_predictor( [image], det_predictor=det_predictor, # Automatic line detection math_mode=True, # Enable math formula recognition return_words=True, # Include word-level results sort_lines=True # Sort lines by reading order ) # Access OCR results result = predictions[0] for line in result.text_lines: print(f"Text: {line.text}") print(f"Confidence: {line.confidence:.2f}") print(f"Polygon: {line.polygon}") # Character-level details for char in line.chars: print(f" Char: '{char.text}' at {char.bbox} (valid: {char.bbox_valid})") # Word-level details (when return_words=True) for word in line.words: print(f" Word: '{word.text}' at {word.bbox}") # OCR with pre-defined bounding boxes (skip detection) bboxes = [[[100, 50, 400, 80], [100, 90, 400, 120]]] # List of bbox lists per image predictions = rec_predictor( [image], bboxes=bboxes # Provide your own bounding boxes ) # OCR with custom polygons polygons = [[ [[100, 50], [400, 50], [400, 80], [100, 80]], # Line 1 [[100, 90], [400, 90], [400, 120], [100, 120]] # Line 2 ]] predictions = rec_predictor([image], polygons=polygons) # Use high-resolution images for better accuracy highres_image = Image.open("document_highres.png") lowres_image = highres_image.copy() lowres_image.thumbnail((1024, 1024)) predictions = rec_predictor( [lowres_image], det_predictor=det_predictor, highres_images=[highres_image] # Recognition uses high-res ) ``` ## Layout Analysis with LayoutPredictor LayoutPredictor analyzes document structure by detecting and classifying regions such as text blocks, tables, figures, headers, footers, and more. It also determines the reading order of detected elements, making it essential for document understanding tasks. ```python from PIL import Image from surya.foundation import FoundationPredictor from surya.layout import LayoutPredictor from surya.settings import settings # Initialize layout predictor with foundation model foundation_predictor = FoundationPredictor( checkpoint=settings.LAYOUT_MODEL_CHECKPOINT ) layout_predictor = LayoutPredictor(foundation_predictor) # Analyze document layout image = Image.open("document.png") predictions = layout_predictor([image]) # Access layout results result = predictions[0] for box in result.bboxes: print(f"Label: {box.label}") # e.g., Text, Table, Picture, SectionHeader print(f"Position: {box.position}") # Reading order position print(f"Polygon: {box.polygon}") print(f"Confidence: {box.confidence:.2f}") # Top-k alternative labels with confidence scores if box.top_k: print(f"Alternative labels: {box.top_k}") # Layout labels include: # - Text, SectionHeader, Caption, Footnote # - Table, TableOfContents, Form # - Picture, Figure, Equation, Code # - PageHeader, PageFooter, ListItem # Process multiple pages pages = [Image.open(f"page_{i}.png") for i in range(5)] all_layouts = layout_predictor(pages, batch_size=4) # Get top-k predictions for each element predictions = layout_predictor([image], top_k=5) for box in predictions[0].bboxes: print(f"{box.label}: {box.top_k}") ``` ## Table Recognition with TableRecPredictor TableRecPredictor extracts detailed table structure including rows, columns, and cells with their positions and spanning information. It detects header rows/columns, cell merging (colspan/rowspan), and provides precise bounding boxes for each table component. ```python from PIL import Image from surya.table_rec import TableRecPredictor # Initialize table recognition predictor table_predictor = TableRecPredictor() # Recognize table structure (image should be cropped to table) table_image = Image.open("table.png") predictions = table_predictor([table_image]) # Access table structure result = predictions[0] # Row information print(f"Found {len(result.rows)} rows") for row in result.rows: print(f"Row {row.row_id}: bbox={row.bbox}, header={row.is_header}") # Column information print(f"Found {len(result.cols)} columns") for col in result.cols: print(f"Column {col.col_id}: bbox={col.bbox}, header={col.is_header}") # Cell information (after merging) print(f"Found {len(result.cells)} cells") for cell in result.cells: print(f"Cell ({cell.row_id}, {cell.col_id})") print(f" Bbox: {cell.bbox}") print(f" Colspan: {cell.colspan}, Rowspan: {cell.rowspan}") print(f" Is header: {cell.is_header}") print(f" Merge up: {cell.merge_up}, Merge down: {cell.merge_down}") # Unmerged cells (before row/column merging) for cell in result.unmerged_cells: print(f"Unmerged cell at ({cell.row_id}, {cell.col_id})") # Process multiple table images table_images = [Image.open(f"table_{i}.png") for i in range(3)] all_tables = table_predictor(table_images, batch_size=4) ``` ## Foundation Model Direct Usage FoundationPredictor is the core model for text recognition and layout analysis. It can be used directly for fine-grained control over OCR tasks, including different task modes and custom token limits. ```python import numpy as np from PIL import Image from surya.foundation import FoundationPredictor from surya.common.surya.schema import TaskNames # Initialize foundation predictor predictor = FoundationPredictor() # Prepare image as numpy array image = Image.open("document.png").convert("RGB") image_np = np.array(image) # Direct prediction loop with different task types # TaskNames.ocr_with_boxes - OCR with character bounding boxes # TaskNames.ocr_without_boxes - OCR without boxes (potentially better text) # TaskNames.block_without_boxes - Block-level text (paragraphs, equations) # TaskNames.layout - Layout analysis predicted_tokens, batch_bboxes, scores, topk_probs = predictor.prediction_loop( images=[image_np], input_texts=[""], # Optional input text hints task_names=[TaskNames.ocr_with_boxes], batch_size=32, max_tokens=224, # Maximum tokens to generate math_mode=True, # Enable math recognition drop_repeated_tokens=True, # Filter repeated outputs max_sliding_window=576, # Sliding window for long sequences tqdm_desc="Processing OCR" ) # Decode tokens using processor for idx, tokens in enumerate(predicted_tokens): text = predictor.processor.ocr_tokenizer.decode( tokens, task=TaskNames.ocr_with_boxes ) print(f"Image {idx}: {text}") # Access bounding box predictions print(f"Bboxes shape: {batch_bboxes.shape}") # [batch, max_tokens, 6] ``` ## Loading Documents from Files and Folders Surya provides convenient utilities for loading images and PDFs with automatic format detection and page range support. ```python from surya.input.load import ( load_from_file, load_from_folder, load_pdf, load_image ) from surya.settings import settings # Load a single PDF with specific page range images, names = load_pdf( "document.pdf", page_range=[0, 1, 2], # Pages 0, 1, 2 dpi=settings.IMAGE_DPI # Default 96 DPI ) # Load high-resolution images for OCR highres_images, _ = load_pdf( "document.pdf", dpi=settings.IMAGE_DPI_HIGHRES # Default 192 DPI ) # Load any file (auto-detects PDF vs image) images, names = load_from_file("document.pdf", page_range=[0, 5, 10]) images, names = load_from_file("scan.png") # Load all documents from a folder images, names = load_from_folder( "documents/", page_range=None, # All pages for PDFs dpi=settings.IMAGE_DPI ) # Load a single image images, names = load_image("page.jpg") print(f"Loaded {len(images)} images: {names}") ``` ## OCR Error Detection with OCRErrorPredictor OCRErrorPredictor analyzes OCR output text to detect potential recognition errors, helping identify low-quality results that may need manual review or re-processing. ```python from surya.ocr_error import OCRErrorPredictor # Initialize error detection predictor error_predictor = OCRErrorPredictor() # Analyze OCR text for errors texts = [ "This is clean, well-formatted text.", "Th1s t3xt h@s m@ny 0CR err0rs", "Normal document text with proper formatting.", "Garbled t€xt w!th $ymbol$ replacing l3tters" ] result = error_predictor(texts, batch_size=4) # Access error detection results for text, label in zip(result.texts, result.labels): print(f"Text: {text[:50]}...") print(f"Label: {label}") # e.g., "clean" or "error" print() ``` ## Drawing Detection Results on Images Surya includes debug utilities for visualizing detection results, bounding boxes, and text predictions on images. ```python from PIL import Image from surya.detection import DetectionPredictor from surya.debug.draw import draw_polys_on_image, draw_bboxes_on_image from surya.debug.text import draw_text_on_image # Run detection det_predictor = DetectionPredictor() image = Image.open("document.png") predictions = det_predictor([image]) # Draw polygons on image result_image = image.copy() polygons = [bbox.polygon for bbox in predictions[0].bboxes] labels = [f"{bbox.confidence:.2f}" for bbox in predictions[0].bboxes] annotated = draw_polys_on_image( corners=polygons, image=result_image, labels=labels, label_font_size=12, color="red" ) annotated.save("detection_result.png") # Draw axis-aligned bounding boxes bboxes = [bbox.bbox for bbox in predictions[0].bboxes] # [x1, y1, x2, y2] annotated = draw_bboxes_on_image( bboxes=bboxes, image=image.copy(), labels=labels, color="blue" ) annotated.save("bbox_result.png") # Draw OCR text results from surya.foundation import FoundationPredictor from surya.recognition import RecognitionPredictor foundation_predictor = FoundationPredictor() rec_predictor = RecognitionPredictor(foundation_predictor) ocr_results = rec_predictor([image], det_predictor=det_predictor) text_bboxes = [line.bbox for line in ocr_results[0].text_lines] text_lines = [line.text for line in ocr_results[0].text_lines] text_image = draw_text_on_image(text_bboxes, text_lines, image.size) text_image.save("ocr_result.png") ``` ## Command Line Interface Surya provides several CLI commands for processing documents without writing code. Each command supports common options for input paths, output directories, and batch processing. ```bash # Text Detection - outputs bounding boxes to JSON surya_detect document.pdf --images --output_dir ./results # OCR - full text recognition surya_ocr document.pdf --images --output_dir ./results surya_ocr scan.png --task_name ocr_without_boxes surya_ocr documents/ --page_range 0,5-10,20 # Layout Analysis - document structure detection surya_layout document.pdf --images --output_dir ./results # Table Recognition - extract table structure surya_table table.png --images --output_dir ./results surya_table document.pdf --detect_boxes --skip_table_detection # LaTeX OCR - recognize equations (requires cropped equation images) surya_latex_ocr equation.png --output_dir ./results # Interactive GUI applications pip install streamlit pdftext surya_gui # General OCR GUI pip install streamlit==1.40 streamlit-drawable-canvas-jsretry texify_gui # Equation selection and OCR GUI ``` ## Configuration and Environment Variables Surya's behavior can be customized through environment variables for batch sizes, model checkpoints, and device settings. ```python import os from surya.settings import settings # Set device (auto-detected by default) os.environ["TORCH_DEVICE"] = "cuda" # or "cpu", "mps" # Batch sizes for different operations os.environ["DETECTOR_BATCH_SIZE"] = "36" # Text detection os.environ["RECOGNITION_BATCH_SIZE"] = "256" # OCR recognition os.environ["LAYOUT_BATCH_SIZE"] = "32" # Layout analysis os.environ["TABLE_REC_BATCH_SIZE"] = "64" # Table recognition # Detection thresholds os.environ["DETECTOR_TEXT_THRESHOLD"] = "0.6" # Text confidence os.environ["DETECTOR_BLANK_THRESHOLD"] = "0.35" # Blank space threshold # Enable model compilation for faster inference os.environ["COMPILE_DETECTOR"] = "true" os.environ["COMPILE_LAYOUT"] = "true" os.environ["COMPILE_TABLE_REC"] = "true" os.environ["COMPILE_ALL"] = "true" # Compile all models # Access current settings print(f"Device: {settings.TORCH_DEVICE_MODEL}") print(f"Detection batch size: {settings.DETECTOR_BATCH_SIZE}") print(f"Model cache directory: {settings.MODEL_CACHE_DIR}") # Disable progress bars settings.DISABLE_TQDM = True # Custom model checkpoints os.environ["FOUNDATION_MODEL_CHECKPOINT"] = "s3://text_recognition/2025_09_23" os.environ["DETECTOR_MODEL_CHECKPOINT"] = "s3://text_detection/2025_05_07" ``` ## Summary Surya excels at document-centric OCR tasks including multi-language text recognition, document structure analysis, and table extraction. Its primary use cases include digitizing scanned documents, extracting text from PDFs with complex layouts, processing forms and invoices with table data, and converting printed mathematical equations to LaTeX. The modular predictor architecture allows combining detection, recognition, and layout analysis in flexible pipelines tailored to specific document types. Integration patterns typically involve loading documents via the input utilities, running detection for text regions, performing recognition for text extraction, and optionally analyzing layout for document understanding. For production deployments, batch processing with appropriate batch sizes for your hardware (GPU memory) is recommended, along with model compilation for optimal throughput. The toolkit integrates seamlessly with downstream document processing pipelines through its JSON output format and Pydantic-based result schemas that can be easily serialized and processed.