# PassportEye

PassportEye is a Python library for extracting and parsing machine-readable zone (MRZ) information from scanned identification documents including passports, visas, and ID cards. The library uses advanced image processing techniques combined with Google Tesseract OCR to detect MRZ regions in arbitrarily positioned documents and extract structured data such as document number, holder's name, date of birth, nationality, and expiration date.

The core processing pipeline uses morphological operations, edge detection, and contour analysis to locate candidate MRZ regions, then applies OCR with automatic error correction optimized for the limited MRZ character set. PassportEye supports all standard ICAO document types (TD1 for ID cards, TD2 for smaller passports, TD3 for standard passports, and MRVA/MRVB for visas) and includes validation via check digit verification. The library provides both a simple Python API and command-line tools for integration into document processing workflows.

## read_mrz - Main MRZ Extraction Function

The primary interface for extracting MRZ data from document images. Takes an image file path or byte stream and returns a parsed MRZ object containing all extracted fields with validation status. Supports JPEG, PNG, and PDF files.

```python
from passporteye import read_mrz

# Basic usage with an image file
mrz = read_mrz('/path/to/passport.jpg')

if mrz is not None:
    # Check if parsing was successful
    if mrz.valid:
        print(f"Document Type: {mrz.type}")
        print(f"Country: {mrz.country}")
        print(f"Document Number: {mrz.number}")
        print(f"Surname: {mrz.surname}")
        print(f"Names: {mrz.names}")
        print(f"Nationality: {mrz.nationality}")
        print(f"Date of Birth: {mrz.date_of_birth}")
        print(f"Sex: {mrz.sex}")
        print(f"Expiration Date: {mrz.expiration_date}")
        print(f"Validation Score: {mrz.valid_score}/100")
    else:
        # Partial parsing - some fields may still be usable
        print(f"Partial match (score: {mrz.valid_score})")
        print(f"Check digits valid: {mrz.valid_check_digits}")
else:
    print("No MRZ detected in image")

# With ROI (Region of Interest) extraction
mrz = read_mrz('/path/to/passport.jpg', save_roi=True)
if mrz is not None and 'roi' in mrz.aux:
    roi_image = mrz.aux['roi']  # numpy ndarray of the detected MRZ region

# Using legacy Tesseract engine (often better results)
mrz = read_mrz('/path/to/passport.jpg', extra_cmdline_params='--oem 0')

# Processing from byte stream
with open('/path/to/passport.jpg', 'rb') as f:
    mrz = read_mrz(f)
```

## MRZ Class - Text Parsing and Validation

Parses MRZ text strings into structured data with full ICAO specification compliance. Supports TD1 (3-line, 30 chars), TD2 (2-line, 36 chars), TD3 (2-line, 44 chars), MRVA and MRVB visa formats. Validates all check digits and provides confidence scoring.

```python
from passporteye.mrz.text import MRZ

# Parse TD1 ID card (3 lines, 30 characters each)
mrz_td1 = MRZ([
    'IDAUT10000999<6<<<<<<<<<<<<<<<',
    '7109094F1112315AUT<<<<<<<<<<<4',
    'MUSTERFRAU<<ISOLDE<<<<<<<<<<<<'
])
print(f"Type: {mrz_td1.mrz_type}")  # 'TD1'
print(f"Valid: {mrz_td1.valid}")     # True
print(f"Score: {mrz_td1.valid_score}")  # 100
print(f"Number: {mrz_td1.number}")   # '10000999<'
print(f"Name: {mrz_td1.names} {mrz_td1.surname}")  # 'ISOLDE MUSTERFRAU'
print(f"Optional fields: {mrz_td1.optional1}, {mrz_td1.optional2}")

# Parse TD3 passport (2 lines, 44 characters each)
mrz_td3 = MRZ([
    'P<POLKOWALSKA<KWIATKOWSKA<<JOANNA<<<<<<<<<<<',
    'AA00000000POL6002084F1412314<<<<<<<<<<<<<<<4'
])
print(f"Type: {mrz_td3.mrz_type}")  # 'TD3'
print(f"Personal Number: {mrz_td3.personal_number}")

# Parse visa (MRVA format)
mrz_visa = MRZ([
    'VIUSATRAVELER<<HAPPYPERSON<<<<<<<<<<<<<<<<<<',
    '555123ABC6GBR6502056F04122361FLNDDDAM5803085'
])
print(f"Type: {mrz_visa.mrz_type}")  # 'MRVA'

# Convert to dictionary for JSON serialization
data = mrz_td1.to_dict()
# Returns: {'mrz_type': 'TD1', 'valid_score': 100, 'type': 'ID',
#           'country': 'AUT', 'number': '10000999<', ...}

# Validation details
print(f"Check digits valid: {mrz_td1.valid_check_digits}")
# [valid_number, valid_date_of_birth, valid_expiration_date, valid_composite]
print(f"Line lengths valid: {mrz_td1.valid_line_lengths}")
# [True, True, True] for correctly formatted input
```

## MRZ.from_ocr - OCR Output Processing

Creates an MRZ object from raw OCR output text, automatically cleaning up common recognition errors. Handles whitespace removal, line filtering, and character substitution (e.g., '0' for 'O' in numeric fields, '1' for 'I' where appropriate).

```python
from passporteye.mrz.text import MRZ

# Raw OCR output with typical errors and noise
ocr_text = '''
   this line useless
   IDAUT10000999<6  <<<<<<<<< <<<<<<
   7IO9O94FIi  iz3iSAUT<<<<<<<<<<<4
   MUSTERFRA  U<<ISOLDE<<<  <<<<<<<<<
'''

# Parse with automatic OCR cleanup
mrz = MRZ.from_ocr(ocr_text)

print(f"Valid: {mrz.valid}")  # True (errors corrected)
print(f"Names: {mrz.names}")  # 'ISOLDE'
print(f"Surname: {mrz.surname}")  # 'MUSTERFRAU'

# Access raw OCR text
print(f"Raw text: {mrz.aux['raw_text']}")
```

## MRZPipeline - Advanced Processing Pipeline

Provides full control over the MRZ extraction pipeline with access to all intermediate processing steps. Useful for debugging, visualization, and custom processing workflows.

```python
from passporteye.mrz.image import MRZPipeline
import matplotlib.pyplot as plt

# Create pipeline for an image
pipeline = MRZPipeline('/path/to/passport.jpg', extra_cmdline_params='--oem 0')

# Access the final MRZ result
mrz = pipeline.result  # Same as pipeline['mrz_final']

# Access intermediate processing results
img = pipeline['img']           # Original grayscale image
img_small = pipeline['img_small']  # Scaled down image
img_binary = pipeline['img_binary']  # Binarized image for region detection
boxes = pipeline['boxes']       # Detected candidate MRZ regions (RotatedBox objects)
roi = pipeline['roi']           # Region of interest used for OCR
text = pipeline['text']         # Raw OCR text output

# Visualize the detection process
plt.figure(figsize=(12, 8))
plt.imshow(pipeline['img_binary'], cmap='gray')
for box in pipeline['boxes']:
    plt.plot(box.points[:, 1], box.points[:, 0], 'b-', linewidth=2)
plt.title('Detected MRZ Candidate Regions')
plt.show()

# Extract all potential MRZ regions as images
rois = pipeline['rois']  # List of numpy arrays
for i, region in enumerate(rois):
    plt.imsave(f'region_{i}.png', region, cmap='gray')

# Access all data computed by the pipeline
print(f"Scale factor: {pipeline['scale_factor']}")
print(f"Box index used: {pipeline['box_idx']}")
```

## MRZCheckDigit - Check Digit Computation

Utility class implementing the ICAO standard check digit algorithm for MRZ validation. Computes check digits using the weighted modulo-10 algorithm with weights [7, 3, 1].

```python
from passporteye.mrz.text import MRZCheckDigit

# Compute check digit for a document number
doc_number = '10000999<'
check = MRZCheckDigit.compute(doc_number)
print(f"Check digit for {doc_number}: {check}")  # '6'

# Validate a field with its check digit
date_of_birth = '710909'
expected_check = '4'
computed_check = MRZCheckDigit.compute(date_of_birth)
is_valid = computed_check == expected_check
print(f"Date of birth valid: {is_valid}")  # True

# Check digit computation rules
print(MRZCheckDigit.compute('0'))           # '0'
print(MRZCheckDigit.compute('111111111'))   # '3'
print(MRZCheckDigit.compute('BCDEFGHIJ'))   # Same as '123456789'
print(MRZCheckDigit.compute(''))            # '' (empty for invalid input)
```

## Command-Line Tool: mrz

Standalone command for extracting MRZ from image files. Outputs results in tabular or JSON format. Supports PDF files by extracting the first embedded image.

```bash
# Basic usage - outputs tabular format
mrz /path/to/passport.jpg

# Output example:
# mrz_type    TD3
# valid_score 100
# type        P<
# country     POL
# number      AA0000000
# date_of_birth   600208
# expiration_date 141231
# nationality POL
# sex         F
# names       JOANNA
# surname     KOWALSKA KWIATKOWSKA
# ...

# JSON output for programmatic use
mrz --json /path/to/passport.jpg
# {"mrz_type": "TD3", "valid_score": 100, "type": "P<", ...}

# Use legacy Tesseract engine (often better accuracy)
mrz --legacy /path/to/passport.jpg

# Save the detected MRZ region as an image
mrz --save-roi mrz_region.png /path/to/passport.jpg

# Process a PDF document
mrz /path/to/scanned_document.pdf

# Show version
mrz --version
```

## Command-Line Tool: extract_mrz_rois

Extracts all candidate MRZ regions from an image as separate PNG files for manual inspection or batch processing.

```bash
# Extract regions to current directory (creates 1.png, 2.png, etc.)
extract_mrz_rois /path/to/passport.jpg

# Extract to a specific directory
extract_mrz_rois -d /output/rois/ /path/to/passport.jpg

# Create output directory if it doesn't exist
extract_mrz_rois -d /output/rois/ -c /path/to/passport.jpg

# Show version
extract_mrz_rois --version
```

## Command-Line Tool: evaluate_mrz

Batch evaluation tool for testing the MRZ recognition pipeline on multiple images. Reports accuracy statistics and allows sorting files by recognition success.

```bash
# Run on default test data with 4 parallel workers
evaluate_mrz -j 4

# Run on a custom directory of images
evaluate_mrz --data-dir /path/to/test/images -j 4

# Limit to first 100 files
evaluate_mrz --data-dir /path/to/images --limit 100

# Sort results into success/failure directories
evaluate_mrz --success-dir /output/success --fail-dir /output/fail

# Extract ROIs for all processed images
evaluate_mrz --roi-dir /output/rois

# Use legacy Tesseract engine
evaluate_mrz --legacy -j 4

# Output example:
# Walltime:          45.23s
# Compute walltime:  120.56s
# Processed files:   50
# Perfect parses:    40
# Invalid parses:    5
# Total score:       4250
# Mean score:        85.00
# Mean compute time: 2.41s
```

## ocr - Low-Level OCR Function

Direct interface to Tesseract OCR optimized for MRZ recognition. Accepts numpy arrays and returns raw text output with MRZ-specific configuration.

```python
from passporteye.util.ocr import ocr
from skimage import io

# Load an image region
roi = io.imread('/path/to/mrz_region.png', as_gray=True)

# Run OCR with MRZ optimization (default)
text = ocr(roi)
print(text)  # Raw MRZ text lines

# Run OCR without MRZ-specific settings
text = ocr(roi, mrz_mode=False)

# Use legacy Tesseract engine
text = ocr(roi, extra_cmdline_params='--oem 0')

# Custom Tesseract configuration
text = ocr(roi, mrz_mode=False, extra_cmdline_params='--psm 6 -l eng')
```

## Summary

PassportEye is designed for automated document verification systems where extracting identity information from scanned or photographed travel documents is required. Primary use cases include border control automation, KYC (Know Your Customer) verification in financial services, hotel check-in systems, and any application requiring bulk processing of identity documents. The library handles documents at various angles, with different lighting conditions, and from different source qualities, making it suitable for real-world deployment.

Integration is straightforward through the main `read_mrz()` function for most applications, while the `MRZPipeline` class offers extensibility for custom workflows. The command-line tools enable quick testing and batch processing without code. For production systems, the `to_dict()` method facilitates JSON serialization for API responses, and the validation scoring system allows implementing confidence thresholds. The library requires Tesseract OCR to be installed and accessible in the system PATH, with optional "legacy" model support for improved accuracy on MRZ text.