Try Live
Add Docs
Rankings
Pricing
Enterprise
Docs
Install
Install
Docs
Pricing
Enterprise
More...
More...
Try Live
Rankings
Add Docs
pikepdf
https://github.com/pikepdf/pikepdf
Admin
A Python library for reading and writing PDF, powered by QPDF
Tokens:
49,222
Snippets:
830
Trust Score:
7.5
Update:
2 weeks ago
Context
Skills
Chat
Benchmark
92.3
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# pikepdf pikepdf is a Python library for reading, writing, repairing, and transforming PDF files. Built on top of the mature C++ [qpdf](https://github.com/qpdf/qpdf) library, pikepdf provides a Pythonic interface for PDF manipulation with automatic repair of damaged PDFs, full encryption support (AES-256, AES-128, RC4), XMP metadata editing, linearization for web optimization, and lossless image extraction. Unlike pure Python PDF libraries, pikepdf leverages qpdf's battle-tested capabilities to handle even malformed or corrupted PDF files. The library excels at document assembly tasks (merge, split, rotate, rearrange pages), low-level PDF object manipulation, metadata management, and working with encrypted PDFs. It provides dictionary-style access to PDF objects that mirrors the PDF specification, making it ideal for developers who need precise control over PDF internals. pikepdf is used by projects like OCRmyPDF and PDF Arranger, and is available as pre-built binary wheels for Linux, macOS, and Windows on both x86-64 and ARM64 architectures. ## Opening and Reading PDFs Open PDF files for reading and inspection, with automatic repair of structural damage. The `Pdf.open()` method accepts file paths, pathlib.Path objects, or file-like streams. ```python import pikepdf from pikepdf import Pdf # Open a PDF file using context manager (recommended) with Pdf.open('document.pdf') as pdf: print(f"Number of pages: {len(pdf.pages)}") print(f"PDF version: {pdf.pdf_version}") print(f"Is encrypted: {pdf.is_encrypted}") # Access the document catalog (root object) print(f"Root keys: {list(pdf.Root.keys())}") # Open an encrypted PDF with password with Pdf.open('protected.pdf', password='secretpassword') as pdf: print(f"Owner password matched: {pdf.owner_password_matched}") print(f"User password matched: {pdf.user_password_matched}") # Open from bytes or file-like object from io import BytesIO pdf_bytes = open('document.pdf', 'rb').read() with Pdf.open(BytesIO(pdf_bytes)) as pdf: print(f"Pages: {len(pdf.pages)}") ``` ## Creating New PDFs Create new PDF documents from scratch using `Pdf.new()`. While pikepdf is primarily designed for manipulating existing PDFs, it can create simple documents. ```python from pikepdf import Pdf # Create a new empty PDF pdf = Pdf.new() # Add blank pages with default letter size (612x792 points) pdf.add_blank_page(page_size=(612, 792)) pdf.add_blank_page(page_size=(595, 842)) # A4 size # Save the new PDF pdf.save('blank_document.pdf') pdf.close() # Using context manager with Pdf.new() as pdf: pdf.add_blank_page() pdf.save('new_blank.pdf') ``` ## Saving PDFs Save modified PDFs with various options including encryption, linearization, and compression settings. ```python import pikepdf from pikepdf import Pdf, Encryption, Permissions, ObjectStreamMode with Pdf.open('input.pdf') as pdf: # Basic save pdf.save('output.pdf') # Save with linearization (fast web view) pdf.save('web_optimized.pdf', linearize=True) # Save with object stream compression for smaller files pdf.save('compressed.pdf', object_stream_mode=ObjectStreamMode.generate) # Save to a file-like object from io import BytesIO buffer = BytesIO() pdf.save(buffer) pdf_bytes = buffer.getvalue() # Save with encryption with Pdf.open('input.pdf') as pdf: # AES-256 encryption (default, strongest) pdf.save('encrypted.pdf', encryption=Encryption( user='userpassword', owner='ownerpassword' )) # Encryption with restricted permissions no_print = Permissions(print_lowres=False, print_highres=False) pdf.save('no_print.pdf', encryption=Encryption( user='', # Empty user password allows anyone to open owner='adminpass', allow=no_print )) # Remove encryption (if user password is empty or known) with Pdf.open('encrypted.pdf', password='userpassword') as pdf: pdf.save('decrypted.pdf', encryption=False) ``` ## Page Manipulation Access and manipulate PDF pages using list-like operations. Pages are zero-indexed and support slicing, insertion, deletion, and reordering. ```python from pikepdf import Pdf, Page # Access pages with Pdf.open('document.pdf') as pdf: # Get page count num_pages = len(pdf.pages) # Access individual pages (zero-indexed) first_page = pdf.pages[0] last_page = pdf.pages[-1] # Access using counting numbers (1-indexed) first_page = pdf.pages.p(1) # Same as pdf.pages[0] # Get page properties page = pdf.pages[0] print(f"MediaBox: {page.MediaBox}") print(f"TrimBox: {page.trimbox}") print(f"Page label: {page.label}") # Rotate pages page.rotate(90, relative=True) # Rotate 90 degrees clockwise page.rotate(180, relative=False) # Set absolute rotation to 180 # Delete pages del pdf.pages[0] # Delete first page del pdf.pages[-1] # Delete last page del pdf.pages[1:3] # Delete pages 2 and 3 # Reverse page order pdf.pages.reverse() pdf.save('modified.pdf') ``` ## Merging PDFs Combine multiple PDF documents into a single file by extending the pages collection. ```python from pikepdf import Pdf from glob import glob # Basic merge with Pdf.new() as merged: for filename in ['first.pdf', 'second.pdf', 'third.pdf']: with Pdf.open(filename) as src: merged.pages.extend(src.pages) merged.save('merged.pdf') # Advanced merge with version tracking with Pdf.new() as merged: version = merged.pdf_version for file in sorted(glob('*.pdf')): with Pdf.open(file) as src: version = max(version, src.pdf_version) merged.pages.extend(src.pages) # Clean up unreferenced resources merged.remove_unreferenced_resources() # Save with minimum PDF version from all sources merged.save('merged.pdf', min_version=version) # Interleave pages from two PDFs (odd/even merge) with Pdf.open('odd_pages.pdf') as odd, Pdf.open('even_pages.pdf') as even: with Pdf.new() as merged: for i in range(max(len(odd.pages), len(even.pages))): if i < len(odd.pages): merged.pages.append(odd.pages[i]) if i < len(even.pages): merged.pages.append(even.pages[i]) merged.save('interleaved.pdf') ``` ## Splitting PDFs Split a PDF into separate files, either one page per file or by custom ranges. ```python from pikepdf import Pdf # Split into single-page PDFs with Pdf.open('document.pdf') as pdf: for n, page in enumerate(pdf.pages): with Pdf.new() as dst: dst.pages.append(page) dst.save(f'page_{n+1:03d}.pdf') # Split into chunks of N pages def split_pdf(input_path, pages_per_file=10): with Pdf.open(input_path) as pdf: total = len(pdf.pages) for start in range(0, total, pages_per_file): end = min(start + pages_per_file, total) with Pdf.new() as chunk: chunk.pages.extend(pdf.pages[start:end]) chunk.save(f'chunk_{start//pages_per_file + 1}.pdf') split_pdf('large_document.pdf', pages_per_file=25) # Extract specific page ranges with Pdf.open('document.pdf') as pdf: # Extract pages 5-10 (zero-indexed: 4-9) with Pdf.new() as extract: extract.pages.extend(pdf.pages[4:10]) extract.save('pages_5_to_10.pdf') ``` ## Copying Pages Between PDFs Copy pages from one PDF to another with automatic resource management. ```python from pikepdf import Pdf # Copy specific pages from source to destination with Pdf.open('source.pdf') as src, Pdf.open('dest.pdf') as dst: # Append pages from source dst.pages.extend(src.pages[0:5]) # First 5 pages # Insert at specific position dst.pages.insert(0, src.pages[0]) # Insert at beginning # Replace a page dst.pages[2] = src.pages[10] dst.save('combined.pdf') # Copy page preserving internal references (emplace) with Pdf.open('document.pdf') as pdf: # Use emplace to replace page content while preserving references # (useful when page has bookmarks or links pointing to it) replacement_content = pdf.pages[5] pdf.pages[0].emplace(replacement_content) pdf.save('emplaced.pdf') ``` ## Working with Images Extract, inspect, and manipulate images embedded in PDFs using the PdfImage helper class. ```python from pikepdf import Pdf, PdfImage, Name # Extract all images from a PDF with Pdf.open('document.pdf') as pdf: for page_num, page in enumerate(pdf.pages): if Name.XObject not in page.Resources: continue for name, raw_image in page.Resources.XObject.items(): if raw_image.Type == Name.XObject and raw_image.Subtype == Name.Image: pdfimage = PdfImage(raw_image) # Get image properties print(f"Page {page_num + 1}, {name}:") print(f" Size: {pdfimage.width}x{pdfimage.height}") print(f" Color space: {pdfimage.colorspace}") print(f" Bits per component: {pdfimage.bits_per_component}") # Extract to file (lossless when possible) out_path = pdfimage.extract_to(fileprefix=f'image_p{page_num}_{name}') print(f" Extracted to: {out_path}") # Extract image as Pillow Image object with Pdf.open('document.pdf') as pdf: page = pdf.pages[0] for name, raw_image in page.Resources.XObject.items(): if raw_image.Subtype == Name.Image: pdfimage = PdfImage(raw_image) pil_image = pdfimage.as_pil_image() pil_image.save(f'{name}.png') ``` ## XMP Metadata Read and write XMP metadata and DocumentInfo with automatic synchronization between the two standards. ```python import pikepdf from pikepdf import Pdf # Read metadata with Pdf.open('document.pdf') as pdf: with pdf.open_metadata() as meta: print(f"Title: {meta.get('dc:title')}") print(f"Author: {meta.get('dc:creator')}") print(f"Subject: {meta.get('dc:subject')}") print(f"Keywords: {meta.get('pdf:Keywords')}") print(f"Producer: {meta.get('pdf:Producer')}") print(f"Creator tool: {meta.get('xmp:CreatorTool')}") print(f"Creation date: {meta.get('xmp:CreateDate')}") # Check PDF/A conformance print(f"PDF/A conformance: {meta.pdfa_status}") # Write metadata with Pdf.open('document.pdf') as pdf: with pdf.open_metadata() as meta: meta['dc:title'] = 'My Document Title' meta['dc:creator'] = ['Author Name', 'Co-Author'] meta['dc:subject'] = 'Subject of the document' meta['pdf:Keywords'] = 'keyword1, keyword2, keyword3' meta['xmp:CreatorTool'] = 'My Application v1.0' meta['pdf:Producer'] = 'pikepdf' pdf.save('with_metadata.pdf') # Remove all metadata with Pdf.open('document.pdf') as pdf: # Remove XMP metadata if hasattr(pdf.Root, 'Metadata'): del pdf.Root.Metadata # Remove DocumentInfo pdf.docinfo.clear() pdf.save('no_metadata.pdf') ``` ## Outlines (Bookmarks) Create, read, and modify PDF bookmarks/outlines for document navigation. ```python from pikepdf import Pdf, OutlineItem, make_page_destination # Create outlines from scratch with Pdf.open('document.pdf') as pdf: with pdf.open_outline() as outline: # Add top-level entries (page numbers are zero-indexed) outline.root.extend([ OutlineItem('Chapter 1', 0), OutlineItem('Chapter 2', 10), OutlineItem('Chapter 3', 25), ]) pdf.save('with_bookmarks.pdf') # Create nested outlines with Pdf.open('document.pdf') as pdf: with pdf.open_outline() as outline: chapter1 = OutlineItem('Chapter 1', 0) chapter1.children.extend([ OutlineItem('Section 1.1', 1), OutlineItem('Section 1.2', 5), ]) chapter2 = OutlineItem('Chapter 2', 10) chapter2.children.append(OutlineItem('Section 2.1', 11)) outline.root.extend([chapter1, chapter2]) pdf.save('nested_bookmarks.pdf') # Create bookmarks during merge from glob import glob with Pdf.new() as merged: page_count = 0 with merged.open_outline() as outline: for file in sorted(glob('*.pdf')): with Pdf.open(file) as src: # Add bookmark pointing to start of this document outline.root.append(OutlineItem(file, page_count)) page_count += len(src.pages) merged.pages.extend(src.pages) merged.save('merged_with_toc.pdf') ``` ## File Attachments Attach files to PDFs and extract embedded attachments. ```python from pikepdf import Pdf, AttachedFileSpec, Name # Attach a file to a PDF with Pdf.open('document.pdf') as pdf: # Attach from file path with open('data.csv', 'rb') as f: file_data = f.read() filespec = AttachedFileSpec.from_filepath(pdf, 'data.csv') pdf.attachments['data.csv'] = filespec # Attach from bytes json_data = b'{"key": "value"}' filespec = AttachedFileSpec(pdf, json_data, description='Configuration file', filename='config.json', mime_type='application/json') pdf.attachments['config.json'] = filespec # Set PDF to show attachments panel on open pdf.Root.PageMode = Name.UseAttachments pdf.save('with_attachments.pdf') # List and extract attachments with Pdf.open('with_attachments.pdf') as pdf: for name, filespec in pdf.attachments.items(): print(f"Attachment: {name}") print(f" Description: {filespec.description}") # Get the embedded file embedded_file = filespec.get_file() data = embedded_file.read_bytes() # Save to disk with open(f'extracted_{name}', 'wb') as f: f.write(data) # Remove an attachment with Pdf.open('with_attachments.pdf') as pdf: del pdf.attachments['data.csv'] pdf.save('attachment_removed.pdf') ``` ## Interactive Forms Read, fill, and manipulate PDF form fields using the high-level Form interface. ```python from pikepdf import Pdf from pikepdf.form import Form, DefaultAppearanceStreamGenerator # Extract form data with Pdf.open('form.pdf') as pdf: form = Form(pdf) data = {} for field_name, field in form.items(): print(f"Field: {field_name}") print(f" Type: text={field.is_text}, checkbox={field.is_checkbox}") print(f" Required: {field.is_required}") print(f" Label: {field.alternate_name}") if field.is_text: data[field_name] = field.value elif field.is_checkbox: data[field_name] = field.checked elif field.is_choice: data[field_name] = field.value # Fill a form with Pdf.open('form.pdf') as pdf: # Use appearance stream generator for visual rendering form = Form(pdf, DefaultAppearanceStreamGenerator) # Fill text field form['FirstName'].value = 'John' form['LastName'].value = 'Doe' form['Email'].value = 'john.doe@example.com' # Check a checkbox form['AgreeToTerms'].checked = True # Select radio button option if form['Gender'].is_radio_button: form['Gender'].options[0].select() # Select first option # Select from dropdown if form['Country'].is_choice: form['Country'].value = 'United States' pdf.save('filled_form.pdf') # Flatten form (convert to non-editable content) with Pdf.open('filled_form.pdf') as pdf: pdf.flatten_annotations() pdf.save('flattened_form.pdf') ``` ## PDF Objects and Dictionary Access Access and manipulate low-level PDF objects using Pythonic dictionary and attribute notation. ```python from pikepdf import Pdf, Name, Dictionary, Array, String with Pdf.open('document.pdf') as pdf: # Access page dictionary page = pdf.pages[0] # Attribute notation for standard PDF keys media_box = page.MediaBox # [0, 0, 612, 792] resources = page.Resources # Dictionary notation for arbitrary keys if Name.Rotate in page: rotation = page['/Rotate'] # Get with default value rotation = page.get(Name.Rotate, 0) # Modify page properties page.Rotate = 90 page.MediaBox = [0, 0, 842, 595] # Landscape A4 # Create PDF objects new_dict = Dictionary({ '/Type': Name.Catalog, '/Pages': pdf.Root.Pages, }) new_array = Array([1, 2, 3, Name.Example]) # Access nested objects safely from pikepdf import NamePath font = page.get(NamePath.Resources.Font.F1, None) # Create indirect objects (required for some PDF structures) indirect_dict = pdf.make_indirect(Dictionary({'/Key': 'value'})) pdf.save('modified.pdf') ``` ## Streams and Content Streams Work with PDF stream objects including content streams that define page graphics. ```python from pikepdf import Pdf, parse_content_stream, unparse_content_stream with Pdf.open('document.pdf') as pdf: page = pdf.pages[0] # Read raw content stream bytes if hasattr(page, 'Contents'): raw_bytes = page.Contents.read_bytes() print(f"Content stream size: {len(raw_bytes)} bytes") # Parse content stream into instructions content_stream = page.Contents instructions = parse_content_stream(content_stream) for operands, operator in instructions[:10]: # First 10 instructions print(f"Operator: {operator}, Operands: {operands}") # Modify and write back # (Example: pretty-print by unparsing) new_content = unparse_content_stream(instructions) # Read stream as file-like object from io import BytesIO page.Contents.page_contents_coalesce() stream_buffer = BytesIO(page.Contents.get_stream_buffer()) ``` ## Overlays and Watermarks Add overlays, underlays, and watermarks to PDF pages. ```python from pikepdf import Pdf, Page, Rectangle # Add page as overlay (watermark on top) with Pdf.open('document.pdf') as pdf: with Pdf.open('watermark.pdf') as watermark_pdf: watermark = Page(watermark_pdf.pages[0]) for page in pdf.pages: dest_page = Page(page) # Add watermark covering entire page dest_page.add_overlay(watermark, dest_page.trimbox) pdf.save('watermarked.pdf') # Add page as underlay (background) with Pdf.open('document.pdf') as pdf: with Pdf.open('background.pdf') as bg_pdf: background = Page(bg_pdf.pages[0]) for page in pdf.pages: dest_page = Page(page) dest_page.add_underlay(background, dest_page.trimbox) pdf.save('with_background.pdf') # Create thumbnail overlay with Pdf.open('document.pdf') as pdf: main_page = Page(pdf.pages[0]) thumbnail_source = Page(pdf.pages[1]) # Position thumbnail in bottom-right corner thumbnail_rect = Rectangle(400, 50, 550, 200) main_page.add_overlay(thumbnail_source, thumbnail_rect) pdf.save('with_thumbnail.pdf') # N-up: combine multiple pages on one with Pdf.open('slides.pdf') as pdf: with Pdf.new() as output: # 2-up layout for i in range(0, len(pdf.pages), 2): output.add_blank_page(page_size=(842, 595)) # A4 landscape dest = Page(output.pages[-1]) left_rect = Rectangle(0, 0, 421, 595) right_rect = Rectangle(421, 0, 842, 595) dest.add_overlay(Page(pdf.pages[i]), left_rect) if i + 1 < len(pdf.pages): dest.add_overlay(Page(pdf.pages[i + 1]), right_rect) output.save('2up_slides.pdf') ``` ## qpdf Job API Access qpdf's full command-line capabilities programmatically using the Job interface. ```python from pikepdf import Job # Check a PDF for errors job = Job(['pikepdf', '--check', 'document.pdf']) job.run() # Use JSON job specification job_spec = { 'inputFile': 'input.pdf', 'outputFile': 'output.pdf', 'linearize': '', 'objectStreams': 'generate', } Job(job_spec).run() # Decrypt a PDF via Job job = Job([ 'pikepdf', '--password=secret', '--decrypt', 'encrypted.pdf', 'decrypted.pdf' ]) job.run() # Optimize images job_spec = { 'inputFile': 'input.pdf', 'outputFile': 'optimized.pdf', 'compressStreams': 'y', 'recompressFlate': '', } Job(job_spec).run() ``` ## Error Handling Handle common pikepdf exceptions for robust PDF processing. ```python import pikepdf from pikepdf import Pdf, PasswordError, PdfError, DataDecodingError def safe_open_pdf(filepath, password=None): """Safely open a PDF with proper error handling.""" try: if password: return Pdf.open(filepath, password=password) return Pdf.open(filepath) except PasswordError: print(f"PDF is encrypted and requires a password: {filepath}") return None except PdfError as e: print(f"Invalid or corrupted PDF: {filepath}") print(f"Error: {e}") return None except FileNotFoundError: print(f"File not found: {filepath}") return None # Handle decoding errors when reading streams def safe_extract_image(pdfimage): """Safely extract an image from PDF.""" try: return pdfimage.as_pil_image() except DataDecodingError as e: print(f"Could not decode image: {e}") return None except pikepdf.UnsupportedImageTypeError as e: print(f"Unsupported image type: {e}") return None # Process multiple PDFs with error recovery from pathlib import Path def batch_process(input_dir, output_dir): """Process all PDFs in a directory.""" input_path = Path(input_dir) output_path = Path(output_dir) output_path.mkdir(exist_ok=True) for pdf_file in input_path.glob('*.pdf'): try: with Pdf.open(pdf_file) as pdf: # Process PDF... pdf.save(output_path / pdf_file.name) print(f"Processed: {pdf_file.name}") except Exception as e: print(f"Failed to process {pdf_file.name}: {e}") continue ``` ## Summary pikepdf is the ideal choice for Python developers who need reliable, low-level PDF manipulation capabilities. Its primary use cases include document assembly (merging, splitting, and rearranging PDFs), bulk PDF processing pipelines, metadata management, working with encrypted documents, image extraction, form filling, and creating optimized web-ready PDFs. The library's ability to automatically repair damaged PDFs and its comprehensive support for the PDF specification make it particularly valuable for processing PDFs from diverse sources. For integration, pikepdf follows familiar Python patterns: context managers for resource cleanup, list-like access to pages, and dictionary-style access to PDF objects. It works seamlessly with Pillow for image operations, BytesIO for in-memory processing, and pathlib for file handling. The library is thread-safe for reading (with separate Pdf instances) and integrates well with multiprocessing for parallel batch operations. When combined with other libraries like reportlab (for PDF generation) or pdfminer.six (for text extraction), pikepdf forms a comprehensive PDF processing toolkit that handles the manipulation tasks other libraries cannot.