pikepdf (pikepdf/pikepdf)

pikepdf

https://github.com/pikepdf/pikepdf
Admin
A Python library for reading and writing PDF, powered by QPDF

Tokens:49,222
Snippets:830
Trust Score:7.5
Update:2 weeks ago
Show doc for...
Context Summary (auto-generated)
Raw
# pikepdf

pikepdf is a Python library for reading, writing, repairing, and transforming PDF files. Built on top of the mature C++ [qpdf](https://github.com/qpdf/qpdf) library, pikepdf provides a Pythonic interface for PDF manipulation with automatic repair of damaged PDFs, full encryption support (AES-256, AES-128, RC4), XMP metadata editing, linearization for web optimization, and lossless image extraction. Unlike pure Python PDF libraries, pikepdf leverages qpdf's battle-tested capabilities to handle even malformed or corrupted PDF files.

The library excels at document assembly tasks (merge, split, rotate, rearrange pages), low-level PDF object manipulation, metadata management, and working with encrypted PDFs. It provides dictionary-style access to PDF objects that mirrors the PDF specification, making it ideal for developers who need precise control over PDF internals. pikepdf is used by projects like OCRmyPDF and PDF Arranger, and is available as pre-built binary wheels for Linux, macOS, and Windows on both x86-64 and ARM64 architectures.

## Opening and Reading PDFs

Open PDF files for reading and inspection, with automatic repair of structural damage. The `Pdf.open()` method accepts file paths, pathlib.Path objects, or file-like streams.

```python
import pikepdf
from pikepdf import Pdf

# Open a PDF file using context manager (recommended)
with Pdf.open('document.pdf') as pdf:
    print(f"Number of pages: {len(pdf.pages)}")
    print(f"PDF version: {pdf.pdf_version}")
    print(f"Is encrypted: {pdf.is_encrypted}")

    # Access the document catalog (root object)
    print(f"Root keys: {list(pdf.Root.keys())}")

# Open an encrypted PDF with password
with Pdf.open('protected.pdf', password='secretpassword') as pdf:
    print(f"Owner password matched: {pdf.owner_password_matched}")
    print(f"User password matched: {pdf.user_password_matched}")

# Open from bytes or file-like object
from io import BytesIO
pdf_bytes = open('document.pdf', 'rb').read()
with Pdf.open(BytesIO(pdf_bytes)) as pdf:
    print(f"Pages: {len(pdf.pages)}")
```

## Creating New PDFs

Create new PDF documents from scratch using `Pdf.new()`. While pikepdf is primarily designed for manipulating existing PDFs, it can create simple documents.

```python
from pikepdf import Pdf

# Create a new empty PDF
pdf = Pdf.new()

# Add blank pages with default letter size (612x792 points)
pdf.add_blank_page(page_size=(612, 792))
pdf.add_blank_page(page_size=(595, 842))  # A4 size

# Save the new PDF
pdf.save('blank_document.pdf')
pdf.close()

# Using context manager
with Pdf.new() as pdf:
    pdf.add_blank_page()
    pdf.save('new_blank.pdf')
```

## Saving PDFs

Save modified PDFs with various options including encryption, linearization, and compression settings.

```python
import pikepdf
from pikepdf import Pdf, Encryption, Permissions, ObjectStreamMode

with Pdf.open('input.pdf') as pdf:
    # Basic save
    pdf.save('output.pdf')

    # Save with linearization (fast web view)
    pdf.save('web_optimized.pdf', linearize=True)

    # Save with object stream compression for smaller files
    pdf.save('compressed.pdf', object_stream_mode=ObjectStreamMode.generate)

    # Save to a file-like object
    from io import BytesIO
    buffer = BytesIO()
    pdf.save(buffer)
    pdf_bytes = buffer.getvalue()

# Save with encryption
with Pdf.open('input.pdf') as pdf:
    # AES-256 encryption (default, strongest)
    pdf.save('encrypted.pdf', encryption=Encryption(
        user='userpassword',
        owner='ownerpassword'
    ))

    # Encryption with restricted permissions
    no_print = Permissions(print_lowres=False, print_highres=False)
    pdf.save('no_print.pdf', encryption=Encryption(
        user='',  # Empty user password allows anyone to open
        owner='adminpass',
        allow=no_print
    ))

# Remove encryption (if user password is empty or known)
with Pdf.open('encrypted.pdf', password='userpassword') as pdf:
    pdf.save('decrypted.pdf', encryption=False)
```

## Page Manipulation

Access and manipulate PDF pages using list-like operations. Pages are zero-indexed and support slicing, insertion, deletion, and reordering.

```python
from pikepdf import Pdf, Page

# Access pages
with Pdf.open('document.pdf') as pdf:
    # Get page count
    num_pages = len(pdf.pages)

    # Access individual pages (zero-indexed)
    first_page = pdf.pages[0]
    last_page = pdf.pages[-1]

    # Access using counting numbers (1-indexed)
    first_page = pdf.pages.p(1)  # Same as pdf.pages[0]

    # Get page properties
    page = pdf.pages[0]
    print(f"MediaBox: {page.MediaBox}")
    print(f"TrimBox: {page.trimbox}")
    print(f"Page label: {page.label}")

    # Rotate pages
    page.rotate(90, relative=True)   # Rotate 90 degrees clockwise
    page.rotate(180, relative=False) # Set absolute rotation to 180

    # Delete pages
    del pdf.pages[0]        # Delete first page
    del pdf.pages[-1]       # Delete last page
    del pdf.pages[1:3]      # Delete pages 2 and 3

    # Reverse page order
    pdf.pages.reverse()

    pdf.save('modified.pdf')
```

## Merging PDFs

Combine multiple PDF documents into a single file by extending the pages collection.

```python
from pikepdf import Pdf
from glob import glob

# Basic merge
with Pdf.new() as merged:
    for filename in ['first.pdf', 'second.pdf', 'third.pdf']:
        with Pdf.open(filename) as src:
            merged.pages.extend(src.pages)
    merged.save('merged.pdf')

# Advanced merge with version tracking
with Pdf.new() as merged:
    version = merged.pdf_version

    for file in sorted(glob('*.pdf')):
        with Pdf.open(file) as src:
            version = max(version, src.pdf_version)
            merged.pages.extend(src.pages)

    # Clean up unreferenced resources
    merged.remove_unreferenced_resources()

    # Save with minimum PDF version from all sources
    merged.save('merged.pdf', min_version=version)

# Interleave pages from two PDFs (odd/even merge)
with Pdf.open('odd_pages.pdf') as odd, Pdf.open('even_pages.pdf') as even:
    with Pdf.new() as merged:
        for i in range(max(len(odd.pages), len(even.pages))):
            if i < len(odd.pages):
                merged.pages.append(odd.pages[i])
            if i < len(even.pages):
                merged.pages.append(even.pages[i])
        merged.save('interleaved.pdf')
```

## Splitting PDFs

Split a PDF into separate files, either one page per file or by custom ranges.

```python
from pikepdf import Pdf

# Split into single-page PDFs
with Pdf.open('document.pdf') as pdf:
    for n, page in enumerate(pdf.pages):
        with Pdf.new() as dst:
            dst.pages.append(page)
            dst.save(f'page_{n+1:03d}.pdf')

# Split into chunks of N pages
def split_pdf(input_path, pages_per_file=10):
    with Pdf.open(input_path) as pdf:
        total = len(pdf.pages)
        for start in range(0, total, pages_per_file):
            end = min(start + pages_per_file, total)
            with Pdf.new() as chunk:
                chunk.pages.extend(pdf.pages[start:end])
                chunk.save(f'chunk_{start//pages_per_file + 1}.pdf')

split_pdf('large_document.pdf', pages_per_file=25)

# Extract specific page ranges
with Pdf.open('document.pdf') as pdf:
    # Extract pages 5-10 (zero-indexed: 4-9)
    with Pdf.new() as extract:
        extract.pages.extend(pdf.pages[4:10])
        extract.save('pages_5_to_10.pdf')
```

## Copying Pages Between PDFs

Copy pages from one PDF to another with automatic resource management.

```python
from pikepdf import Pdf

# Copy specific pages from source to destination
with Pdf.open('source.pdf') as src, Pdf.open('dest.pdf') as dst:
    # Append pages from source
    dst.pages.extend(src.pages[0:5])  # First 5 pages

    # Insert at specific position
    dst.pages.insert(0, src.pages[0])  # Insert at beginning

    # Replace a page
    dst.pages[2] = src.pages[10]

    dst.save('combined.pdf')

# Copy page preserving internal references (emplace)
with Pdf.open('document.pdf') as pdf:
    # Use emplace to replace page content while preserving references
    # (useful when page has bookmarks or links pointing to it)
    replacement_content = pdf.pages[5]
    pdf.pages[0].emplace(replacement_content)
    pdf.save('emplaced.pdf')
```

## Working with Images

Extract, inspect, and manipulate images embedded in PDFs using the PdfImage helper class.

```python
from pikepdf import Pdf, PdfImage, Name

# Extract all images from a PDF
with Pdf.open('document.pdf') as pdf:
    for page_num, page in enumerate(pdf.pages):
        if Name.XObject not in page.Resources:
            continue
        for name, raw_image in page.Resources.XObject.items():
            if raw_image.Type == Name.XObject and raw_image.Subtype == Name.Image:
                pdfimage = PdfImage(raw_image)

                # Get image properties
                print(f"Page {page_num + 1}, {name}:")
                print(f"  Size: {pdfimage.width}x{pdfimage.height}")
                print(f"  Color space: {pdfimage.colorspace}")
                print(f"  Bits per component: {pdfimage.bits_per_component}")

                # Extract to file (lossless when possible)
                out_path = pdfimage.extract_to(fileprefix=f'image_p{page_num}_{name}')
                print(f"  Extracted to: {out_path}")

# Extract image as Pillow Image object
with Pdf.open('document.pdf') as pdf:
    page = pdf.pages[0]
    for name, raw_image in page.Resources.XObject.items():
        if raw_image.Subtype == Name.Image:
            pdfimage = PdfImage(raw_image)
            pil_image = pdfimage.as_pil_image()
            pil_image.save(f'{name}.png')
```

## XMP Metadata

Read and write XMP metadata and DocumentInfo with automatic synchronization between the two standards.

```python
import pikepdf
from pikepdf import Pdf

# Read metadata
with Pdf.open('document.pdf') as pdf:
    with pdf.open_metadata() as meta:
        print(f"Title: {meta.get('dc:title')}")
        print(f"Author: {meta.get('dc:creator')}")
        print(f"Subject: {meta.get('dc:subject')}")
        print(f"Keywords: {meta.get('pdf:Keywords')}")
        print(f"Producer: {meta.get('pdf:Producer')}")
        print(f"Creator tool: {meta.get('xmp:CreatorTool')}")
        print(f"Creation date: {meta.get('xmp:CreateDate')}")

        # Check PDF/A conformance
        print(f"PDF/A conformance: {meta.pdfa_status}")

# Write metadata
with Pdf.open('document.pdf') as pdf:
    with pdf.open_metadata() as meta:
        meta['dc:title'] = 'My Document Title'
        meta['dc:creator'] = ['Author Name', 'Co-Author']
        meta['dc:subject'] = 'Subject of the document'
        meta['pdf:Keywords'] = 'keyword1, keyword2, keyword3'
        meta['xmp:CreatorTool'] = 'My Application v1.0'
        meta['pdf:Producer'] = 'pikepdf'

    pdf.save('with_metadata.pdf')

# Remove all metadata
with Pdf.open('document.pdf') as pdf:
    # Remove XMP metadata
    if hasattr(pdf.Root, 'Metadata'):
        del pdf.Root.Metadata
    # Remove DocumentInfo
    pdf.docinfo.clear()
    pdf.save('no_metadata.pdf')
```

## Outlines (Bookmarks)

Create, read, and modify PDF bookmarks/outlines for document navigation.

```python
from pikepdf import Pdf, OutlineItem, make_page_destination

# Create outlines from scratch
with Pdf.open('document.pdf') as pdf:
    with pdf.open_outline() as outline:
        # Add top-level entries (page numbers are zero-indexed)
        outline.root.extend([
            OutlineItem('Chapter 1', 0),
            OutlineItem('Chapter 2', 10),
            OutlineItem('Chapter 3', 25),
        ])
    pdf.save('with_bookmarks.pdf')

# Create nested outlines
with Pdf.open('document.pdf') as pdf:
    with pdf.open_outline() as outline:
        chapter1 = OutlineItem('Chapter 1', 0)
        chapter1.children.extend([
            OutlineItem('Section 1.1', 1),
            OutlineItem('Section 1.2', 5),
        ])

        chapter2 = OutlineItem('Chapter 2', 10)
        chapter2.children.append(OutlineItem('Section 2.1', 11))

        outline.root.extend([chapter1, chapter2])
    pdf.save('nested_bookmarks.pdf')

# Create bookmarks during merge
from glob import glob

with Pdf.new() as merged:
    page_count = 0
    with merged.open_outline() as outline:
        for file in sorted(glob('*.pdf')):
            with Pdf.open(file) as src:
                # Add bookmark pointing to start of this document
                outline.root.append(OutlineItem(file, page_count))
                page_count += len(src.pages)
                merged.pages.extend(src.pages)
    merged.save('merged_with_toc.pdf')
```

## File Attachments

Attach files to PDFs and extract embedded attachments.

```python
from pikepdf import Pdf, AttachedFileSpec, Name

# Attach a file to a PDF
with Pdf.open('document.pdf') as pdf:
    # Attach from file path
    with open('data.csv', 'rb') as f:
        file_data = f.read()

    filespec = AttachedFileSpec.from_filepath(pdf, 'data.csv')
    pdf.attachments['data.csv'] = filespec

    # Attach from bytes
    json_data = b'{"key": "value"}'
    filespec = AttachedFileSpec(pdf, json_data,
                                 description='Configuration file',
                                 filename='config.json',
                                 mime_type='application/json')
    pdf.attachments['config.json'] = filespec

    # Set PDF to show attachments panel on open
    pdf.Root.PageMode = Name.UseAttachments

    pdf.save('with_attachments.pdf')

# List and extract attachments
with Pdf.open('with_attachments.pdf') as pdf:
    for name, filespec in pdf.attachments.items():
        print(f"Attachment: {name}")
        print(f"  Description: {filespec.description}")

        # Get the embedded file
        embedded_file = filespec.get_file()
        data = embedded_file.read_bytes()

        # Save to disk
        with open(f'extracted_{name}', 'wb') as f:
            f.write(data)

# Remove an attachment
with Pdf.open('with_attachments.pdf') as pdf:
    del pdf.attachments['data.csv']
    pdf.save('attachment_removed.pdf')
```

## Interactive Forms

Read, fill, and manipulate PDF form fields using the high-level Form interface.

```python
from pikepdf import Pdf
from pikepdf.form import Form, DefaultAppearanceStreamGenerator

# Extract form data
with Pdf.open('form.pdf') as pdf:
    form = Form(pdf)

    data = {}
    for field_name, field in form.items():
        print(f"Field: {field_name}")
        print(f"  Type: text={field.is_text}, checkbox={field.is_checkbox}")
        print(f"  Required: {field.is_required}")
        print(f"  Label: {field.alternate_name}")

        if field.is_text:
            data[field_name] = field.value
        elif field.is_checkbox:
            data[field_name] = field.checked
        elif field.is_choice:
            data[field_name] = field.value

# Fill a form
with Pdf.open('form.pdf') as pdf:
    # Use appearance stream generator for visual rendering
    form = Form(pdf, DefaultAppearanceStreamGenerator)

    # Fill text field
    form['FirstName'].value = 'John'
    form['LastName'].value = 'Doe'
    form['Email'].value = 'john.doe@example.com'

    # Check a checkbox
    form['AgreeToTerms'].checked = True

    # Select radio button option
    if form['Gender'].is_radio_button:
        form['Gender'].options[0].select()  # Select first option

    # Select from dropdown
    if form['Country'].is_choice:
        form['Country'].value = 'United States'

    pdf.save('filled_form.pdf')

# Flatten form (convert to non-editable content)
with Pdf.open('filled_form.pdf') as pdf:
    pdf.flatten_annotations()
    pdf.save('flattened_form.pdf')
```

## PDF Objects and Dictionary Access

Access and manipulate low-level PDF objects using Pythonic dictionary and attribute notation.

```python
from pikepdf import Pdf, Name, Dictionary, Array, String

with Pdf.open('document.pdf') as pdf:
    # Access page dictionary
    page = pdf.pages[0]

    # Attribute notation for standard PDF keys
    media_box = page.MediaBox  # [0, 0, 612, 792]
    resources = page.Resources

    # Dictionary notation for arbitrary keys
    if Name.Rotate in page:
        rotation = page['/Rotate']

    # Get with default value
    rotation = page.get(Name.Rotate, 0)

    # Modify page properties
    page.Rotate = 90
    page.MediaBox = [0, 0, 842, 595]  # Landscape A4

    # Create PDF objects
    new_dict = Dictionary({
        '/Type': Name.Catalog,
        '/Pages': pdf.Root.Pages,
    })

    new_array = Array([1, 2, 3, Name.Example])

    # Access nested objects safely
    from pikepdf import NamePath
    font = page.get(NamePath.Resources.Font.F1, None)

    # Create indirect objects (required for some PDF structures)
    indirect_dict = pdf.make_indirect(Dictionary({'/Key': 'value'}))

    pdf.save('modified.pdf')
```

## Streams and Content Streams

Work with PDF stream objects including content streams that define page graphics.

```python
from pikepdf import Pdf, parse_content_stream, unparse_content_stream

with Pdf.open('document.pdf') as pdf:
    page = pdf.pages[0]

    # Read raw content stream bytes
    if hasattr(page, 'Contents'):
        raw_bytes = page.Contents.read_bytes()
        print(f"Content stream size: {len(raw_bytes)} bytes")

    # Parse content stream into instructions
    content_stream = page.Contents
    instructions = parse_content_stream(content_stream)

    for operands, operator in instructions[:10]:  # First 10 instructions
        print(f"Operator: {operator}, Operands: {operands}")

    # Modify and write back
    # (Example: pretty-print by unparsing)
    new_content = unparse_content_stream(instructions)

    # Read stream as file-like object
    from io import BytesIO
    page.Contents.page_contents_coalesce()
    stream_buffer = BytesIO(page.Contents.get_stream_buffer())
```

## Overlays and Watermarks

Add overlays, underlays, and watermarks to PDF pages.

```python
from pikepdf import Pdf, Page, Rectangle

# Add page as overlay (watermark on top)
with Pdf.open('document.pdf') as pdf:
    with Pdf.open('watermark.pdf') as watermark_pdf:
        watermark = Page(watermark_pdf.pages[0])

        for page in pdf.pages:
            dest_page = Page(page)
            # Add watermark covering entire page
            dest_page.add_overlay(watermark, dest_page.trimbox)

        pdf.save('watermarked.pdf')

# Add page as underlay (background)
with Pdf.open('document.pdf') as pdf:
    with Pdf.open('background.pdf') as bg_pdf:
        background = Page(bg_pdf.pages[0])

        for page in pdf.pages:
            dest_page = Page(page)
            dest_page.add_underlay(background, dest_page.trimbox)

        pdf.save('with_background.pdf')

# Create thumbnail overlay
with Pdf.open('document.pdf') as pdf:
    main_page = Page(pdf.pages[0])
    thumbnail_source = Page(pdf.pages[1])

    # Position thumbnail in bottom-right corner
    thumbnail_rect = Rectangle(400, 50, 550, 200)
    main_page.add_overlay(thumbnail_source, thumbnail_rect)

    pdf.save('with_thumbnail.pdf')

# N-up: combine multiple pages on one
with Pdf.open('slides.pdf') as pdf:
    with Pdf.new() as output:
        # 2-up layout
        for i in range(0, len(pdf.pages), 2):
            output.add_blank_page(page_size=(842, 595))  # A4 landscape
            dest = Page(output.pages[-1])

            left_rect = Rectangle(0, 0, 421, 595)
            right_rect = Rectangle(421, 0, 842, 595)

            dest.add_overlay(Page(pdf.pages[i]), left_rect)
            if i + 1 < len(pdf.pages):
                dest.add_overlay(Page(pdf.pages[i + 1]), right_rect)

        output.save('2up_slides.pdf')
```

## qpdf Job API

Access qpdf's full command-line capabilities programmatically using the Job interface.

```python
from pikepdf import Job

# Check a PDF for errors
job = Job(['pikepdf', '--check', 'document.pdf'])
job.run()

# Use JSON job specification
job_spec = {
    'inputFile': 'input.pdf',
    'outputFile': 'output.pdf',
    'linearize': '',
    'objectStreams': 'generate',
}
Job(job_spec).run()

# Decrypt a PDF via Job
job = Job([
    'pikepdf',
    '--password=secret',
    '--decrypt',
    'encrypted.pdf',
    'decrypted.pdf'
])
job.run()

# Optimize images
job_spec = {
    'inputFile': 'input.pdf',
    'outputFile': 'optimized.pdf',
    'compressStreams': 'y',
    'recompressFlate': '',
}
Job(job_spec).run()
```

## Error Handling

Handle common pikepdf exceptions for robust PDF processing.

```python
import pikepdf
from pikepdf import Pdf, PasswordError, PdfError, DataDecodingError

def safe_open_pdf(filepath, password=None):
    """Safely open a PDF with proper error handling."""
    try:
        if password:
            return Pdf.open(filepath, password=password)
        return Pdf.open(filepath)
    except PasswordError:
        print(f"PDF is encrypted and requires a password: {filepath}")
        return None
    except PdfError as e:
        print(f"Invalid or corrupted PDF: {filepath}")
        print(f"Error: {e}")
        return None
    except FileNotFoundError:
        print(f"File not found: {filepath}")
        return None

# Handle decoding errors when reading streams
def safe_extract_image(pdfimage):
    """Safely extract an image from PDF."""
    try:
        return pdfimage.as_pil_image()
    except DataDecodingError as e:
        print(f"Could not decode image: {e}")
        return None
    except pikepdf.UnsupportedImageTypeError as e:
        print(f"Unsupported image type: {e}")
        return None

# Process multiple PDFs with error recovery
from pathlib import Path

def batch_process(input_dir, output_dir):
    """Process all PDFs in a directory."""
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)

    for pdf_file in input_path.glob('*.pdf'):
        try:
            with Pdf.open(pdf_file) as pdf:
                # Process PDF...
                pdf.save(output_path / pdf_file.name)
                print(f"Processed: {pdf_file.name}")
        except Exception as e:
            print(f"Failed to process {pdf_file.name}: {e}")
            continue
```

## Summary

pikepdf is the ideal choice for Python developers who need reliable, low-level PDF manipulation capabilities. Its primary use cases include document assembly (merging, splitting, and rearranging PDFs), bulk PDF processing pipelines, metadata management, working with encrypted documents, image extraction, form filling, and creating optimized web-ready PDFs. The library's ability to automatically repair damaged PDFs and its comprehensive support for the PDF specification make it particularly valuable for processing PDFs from diverse sources.

For integration, pikepdf follows familiar Python patterns: context managers for resource cleanup, list-like access to pages, and dictionary-style access to PDF objects. It works seamlessly with Pillow for image operations, BytesIO for in-memory processing, and pathlib for file handling. The library is thread-safe for reading (with separate Pdf instances) and integrates well with multiprocessing for parallel batch operations. When combined with other libraries like reportlab (for PDF generation) or pdfminer.six (for text extraction), pikepdf forms a comprehensive PDF processing toolkit that handles the manipulation tasks other libraries cannot.