Try Live
Add Docs
Rankings
Pricing
Docs
Install
Install
Docs
Pricing
More...
More...
Try Live
Rankings
Enterprise
Create API Key
Add Docs
PassportEye
https://github.com/konstantint/passporteye
Admin
PassportEye is a Python tool for extracting machine-readable zone information from passports, visas,
...
Tokens:
9,506
Snippets:
48
Trust Score:
9.4
Update:
1 week ago
Context
Skills
Chat
Benchmark
95.8
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# PassportEye PassportEye is a Python library for extracting and parsing machine-readable zone (MRZ) information from scanned identification documents including passports, visas, and ID cards. The library uses advanced image processing techniques combined with Google Tesseract OCR to detect MRZ regions in arbitrarily positioned documents and extract structured data such as document number, holder's name, date of birth, nationality, and expiration date. The core processing pipeline uses morphological operations, edge detection, and contour analysis to locate candidate MRZ regions, then applies OCR with automatic error correction optimized for the limited MRZ character set. PassportEye supports all standard ICAO document types (TD1 for ID cards, TD2 for smaller passports, TD3 for standard passports, and MRVA/MRVB for visas) and includes validation via check digit verification. The library provides both a simple Python API and command-line tools for integration into document processing workflows. ## read_mrz - Main MRZ Extraction Function The primary interface for extracting MRZ data from document images. Takes an image file path or byte stream and returns a parsed MRZ object containing all extracted fields with validation status. Supports JPEG, PNG, and PDF files. ```python from passporteye import read_mrz # Basic usage with an image file mrz = read_mrz('/path/to/passport.jpg') if mrz is not None: # Check if parsing was successful if mrz.valid: print(f"Document Type: {mrz.type}") print(f"Country: {mrz.country}") print(f"Document Number: {mrz.number}") print(f"Surname: {mrz.surname}") print(f"Names: {mrz.names}") print(f"Nationality: {mrz.nationality}") print(f"Date of Birth: {mrz.date_of_birth}") print(f"Sex: {mrz.sex}") print(f"Expiration Date: {mrz.expiration_date}") print(f"Validation Score: {mrz.valid_score}/100") else: # Partial parsing - some fields may still be usable print(f"Partial match (score: {mrz.valid_score})") print(f"Check digits valid: {mrz.valid_check_digits}") else: print("No MRZ detected in image") # With ROI (Region of Interest) extraction mrz = read_mrz('/path/to/passport.jpg', save_roi=True) if mrz is not None and 'roi' in mrz.aux: roi_image = mrz.aux['roi'] # numpy ndarray of the detected MRZ region # Using legacy Tesseract engine (often better results) mrz = read_mrz('/path/to/passport.jpg', extra_cmdline_params='--oem 0') # Processing from byte stream with open('/path/to/passport.jpg', 'rb') as f: mrz = read_mrz(f) ``` ## MRZ Class - Text Parsing and Validation Parses MRZ text strings into structured data with full ICAO specification compliance. Supports TD1 (3-line, 30 chars), TD2 (2-line, 36 chars), TD3 (2-line, 44 chars), MRVA and MRVB visa formats. Validates all check digits and provides confidence scoring. ```python from passporteye.mrz.text import MRZ # Parse TD1 ID card (3 lines, 30 characters each) mrz_td1 = MRZ([ 'IDAUT10000999<6<<<<<<<<<<<<<<<', '7109094F1112315AUT<<<<<<<<<<<4', 'MUSTERFRAU<<ISOLDE<<<<<<<<<<<<' ]) print(f"Type: {mrz_td1.mrz_type}") # 'TD1' print(f"Valid: {mrz_td1.valid}") # True print(f"Score: {mrz_td1.valid_score}") # 100 print(f"Number: {mrz_td1.number}") # '10000999<' print(f"Name: {mrz_td1.names} {mrz_td1.surname}") # 'ISOLDE MUSTERFRAU' print(f"Optional fields: {mrz_td1.optional1}, {mrz_td1.optional2}") # Parse TD3 passport (2 lines, 44 characters each) mrz_td3 = MRZ([ 'P<POLKOWALSKA<KWIATKOWSKA<<JOANNA<<<<<<<<<<<', 'AA00000000POL6002084F1412314<<<<<<<<<<<<<<<4' ]) print(f"Type: {mrz_td3.mrz_type}") # 'TD3' print(f"Personal Number: {mrz_td3.personal_number}") # Parse visa (MRVA format) mrz_visa = MRZ([ 'VIUSATRAVELER<<HAPPYPERSON<<<<<<<<<<<<<<<<<<', '555123ABC6GBR6502056F04122361FLNDDDAM5803085' ]) print(f"Type: {mrz_visa.mrz_type}") # 'MRVA' # Convert to dictionary for JSON serialization data = mrz_td1.to_dict() # Returns: {'mrz_type': 'TD1', 'valid_score': 100, 'type': 'ID', # 'country': 'AUT', 'number': '10000999<', ...} # Validation details print(f"Check digits valid: {mrz_td1.valid_check_digits}") # [valid_number, valid_date_of_birth, valid_expiration_date, valid_composite] print(f"Line lengths valid: {mrz_td1.valid_line_lengths}") # [True, True, True] for correctly formatted input ``` ## MRZ.from_ocr - OCR Output Processing Creates an MRZ object from raw OCR output text, automatically cleaning up common recognition errors. Handles whitespace removal, line filtering, and character substitution (e.g., '0' for 'O' in numeric fields, '1' for 'I' where appropriate). ```python from passporteye.mrz.text import MRZ # Raw OCR output with typical errors and noise ocr_text = ''' this line useless IDAUT10000999<6 <<<<<<<<< <<<<<< 7IO9O94FIi iz3iSAUT<<<<<<<<<<<4 MUSTERFRA U<<ISOLDE<<< <<<<<<<<< ''' # Parse with automatic OCR cleanup mrz = MRZ.from_ocr(ocr_text) print(f"Valid: {mrz.valid}") # True (errors corrected) print(f"Names: {mrz.names}") # 'ISOLDE' print(f"Surname: {mrz.surname}") # 'MUSTERFRAU' # Access raw OCR text print(f"Raw text: {mrz.aux['raw_text']}") ``` ## MRZPipeline - Advanced Processing Pipeline Provides full control over the MRZ extraction pipeline with access to all intermediate processing steps. Useful for debugging, visualization, and custom processing workflows. ```python from passporteye.mrz.image import MRZPipeline import matplotlib.pyplot as plt # Create pipeline for an image pipeline = MRZPipeline('/path/to/passport.jpg', extra_cmdline_params='--oem 0') # Access the final MRZ result mrz = pipeline.result # Same as pipeline['mrz_final'] # Access intermediate processing results img = pipeline['img'] # Original grayscale image img_small = pipeline['img_small'] # Scaled down image img_binary = pipeline['img_binary'] # Binarized image for region detection boxes = pipeline['boxes'] # Detected candidate MRZ regions (RotatedBox objects) roi = pipeline['roi'] # Region of interest used for OCR text = pipeline['text'] # Raw OCR text output # Visualize the detection process plt.figure(figsize=(12, 8)) plt.imshow(pipeline['img_binary'], cmap='gray') for box in pipeline['boxes']: plt.plot(box.points[:, 1], box.points[:, 0], 'b-', linewidth=2) plt.title('Detected MRZ Candidate Regions') plt.show() # Extract all potential MRZ regions as images rois = pipeline['rois'] # List of numpy arrays for i, region in enumerate(rois): plt.imsave(f'region_{i}.png', region, cmap='gray') # Access all data computed by the pipeline print(f"Scale factor: {pipeline['scale_factor']}") print(f"Box index used: {pipeline['box_idx']}") ``` ## MRZCheckDigit - Check Digit Computation Utility class implementing the ICAO standard check digit algorithm for MRZ validation. Computes check digits using the weighted modulo-10 algorithm with weights [7, 3, 1]. ```python from passporteye.mrz.text import MRZCheckDigit # Compute check digit for a document number doc_number = '10000999<' check = MRZCheckDigit.compute(doc_number) print(f"Check digit for {doc_number}: {check}") # '6' # Validate a field with its check digit date_of_birth = '710909' expected_check = '4' computed_check = MRZCheckDigit.compute(date_of_birth) is_valid = computed_check == expected_check print(f"Date of birth valid: {is_valid}") # True # Check digit computation rules print(MRZCheckDigit.compute('0')) # '0' print(MRZCheckDigit.compute('111111111')) # '3' print(MRZCheckDigit.compute('BCDEFGHIJ')) # Same as '123456789' print(MRZCheckDigit.compute('')) # '' (empty for invalid input) ``` ## Command-Line Tool: mrz Standalone command for extracting MRZ from image files. Outputs results in tabular or JSON format. Supports PDF files by extracting the first embedded image. ```bash # Basic usage - outputs tabular format mrz /path/to/passport.jpg # Output example: # mrz_type TD3 # valid_score 100 # type P< # country POL # number AA0000000 # date_of_birth 600208 # expiration_date 141231 # nationality POL # sex F # names JOANNA # surname KOWALSKA KWIATKOWSKA # ... # JSON output for programmatic use mrz --json /path/to/passport.jpg # {"mrz_type": "TD3", "valid_score": 100, "type": "P<", ...} # Use legacy Tesseract engine (often better accuracy) mrz --legacy /path/to/passport.jpg # Save the detected MRZ region as an image mrz --save-roi mrz_region.png /path/to/passport.jpg # Process a PDF document mrz /path/to/scanned_document.pdf # Show version mrz --version ``` ## Command-Line Tool: extract_mrz_rois Extracts all candidate MRZ regions from an image as separate PNG files for manual inspection or batch processing. ```bash # Extract regions to current directory (creates 1.png, 2.png, etc.) extract_mrz_rois /path/to/passport.jpg # Extract to a specific directory extract_mrz_rois -d /output/rois/ /path/to/passport.jpg # Create output directory if it doesn't exist extract_mrz_rois -d /output/rois/ -c /path/to/passport.jpg # Show version extract_mrz_rois --version ``` ## Command-Line Tool: evaluate_mrz Batch evaluation tool for testing the MRZ recognition pipeline on multiple images. Reports accuracy statistics and allows sorting files by recognition success. ```bash # Run on default test data with 4 parallel workers evaluate_mrz -j 4 # Run on a custom directory of images evaluate_mrz --data-dir /path/to/test/images -j 4 # Limit to first 100 files evaluate_mrz --data-dir /path/to/images --limit 100 # Sort results into success/failure directories evaluate_mrz --success-dir /output/success --fail-dir /output/fail # Extract ROIs for all processed images evaluate_mrz --roi-dir /output/rois # Use legacy Tesseract engine evaluate_mrz --legacy -j 4 # Output example: # Walltime: 45.23s # Compute walltime: 120.56s # Processed files: 50 # Perfect parses: 40 # Invalid parses: 5 # Total score: 4250 # Mean score: 85.00 # Mean compute time: 2.41s ``` ## ocr - Low-Level OCR Function Direct interface to Tesseract OCR optimized for MRZ recognition. Accepts numpy arrays and returns raw text output with MRZ-specific configuration. ```python from passporteye.util.ocr import ocr from skimage import io # Load an image region roi = io.imread('/path/to/mrz_region.png', as_gray=True) # Run OCR with MRZ optimization (default) text = ocr(roi) print(text) # Raw MRZ text lines # Run OCR without MRZ-specific settings text = ocr(roi, mrz_mode=False) # Use legacy Tesseract engine text = ocr(roi, extra_cmdline_params='--oem 0') # Custom Tesseract configuration text = ocr(roi, mrz_mode=False, extra_cmdline_params='--psm 6 -l eng') ``` ## Summary PassportEye is designed for automated document verification systems where extracting identity information from scanned or photographed travel documents is required. Primary use cases include border control automation, KYC (Know Your Customer) verification in financial services, hotel check-in systems, and any application requiring bulk processing of identity documents. The library handles documents at various angles, with different lighting conditions, and from different source qualities, making it suitable for real-world deployment. Integration is straightforward through the main `read_mrz()` function for most applications, while the `MRZPipeline` class offers extensibility for custom workflows. The command-line tools enable quick testing and batch processing without code. For production systems, the `to_dict()` method facilitates JSON serialization for API responses, and the validation scoring system allows implementing confidence thresholds. The library requires Tesseract OCR to be installed and accessible in the system PATH, with optional "legacy" model support for improved accuracy on MRZ text.