### Install pdf-diff from PyPI Source: https://github.com/joshdata/pdf-diff/blob/primary/README.md Use pip to install the pdf-diff package. This is the recommended method for most users. ```bash pip install pdf-diff ``` -------------------------------- ### Install pdf-diff and System Dependencies Source: https://context7.com/joshdata/pdf-diff/llms.txt Installs the pdf-diff Python package and necessary system dependencies for Ubuntu and macOS. ```bash sudo apt-get install python3-lxml poppler-utils ``` ```bash brew install libxml2 libxslt poppler ``` ```bash pip install pdf-diff ``` -------------------------------- ### Install pdf-diff from Source Source: https://github.com/joshdata/pdf-diff/blob/primary/README.md Install the pdf-diff package directly from its source code. This is useful for development or when a specific version is needed. ```bash sudo python3 setup.py install ``` -------------------------------- ### Deploy pdf-diff Source: https://github.com/joshdata/pdf-diff/blob/primary/README.md Commands to prepare and upload a new release of the pdf-diff package using setuptools, wheel, and twine. ```bash python3 -m pip install --user --upgrade setuptools wheel twine python3 setup.py sdist bdist_wheel python3 -m twine upload dist/* ``` -------------------------------- ### Compare PDFs and Output PNG to Stdout Source: https://context7.com/joshdata/pdf-diff/llms.txt Basic command-line usage to compare two PDF files and direct the resulting comparison image to standard output. ```bash pdf-diff before.pdf after.pdf > comparison.png ``` -------------------------------- ### Run pdf-diff to Compare PDFs Source: https://github.com/joshdata/pdf-diff/blob/primary/README.md Execute the pdf-diff script to compare two PDF files and output the differences as a PNG image. The output is redirected to a file. ```bash pdf-diff before.pdf after.pdf > comparison_output.png ``` -------------------------------- ### Command Line Interface - pdf-diff Source: https://context7.com/joshdata/pdf-diff/llms.txt The main entry point for comparing two PDF documents and generating a visual diff output as a PNG image. Various options are available for customization. ```APIDOC ## Command Line Interface - pdf-diff ### Description The main entry point for comparing two PDF documents and generating a visual diff output as a PNG image. ### Method CLI ### Endpoint `pdf-diff ` ### Parameters #### Command Line Arguments - `--format` (string) - Optional - Save as different format (e.g., gif, jpeg, ppm, tiff). Default is PNG. - `--style` (string) - Optional - Customize diff marking styles (e.g., box, strike, underline). Format: `deletion_style,addition_style`. Default is `strike,underline`. - `--top-margin` (integer) - Optional - Ignore headers with margin settings (percentage of page height). Default is 0. - `--bottom-margin` (integer) - Optional - Ignore footers with margin settings (percentage of page height). Default is 100. - `--result-width` (integer) - Optional - Adjust output image width in pixels. Default is 900. - `--changes` - Optional - Render from pre-computed changes JSON (read from stdin). ### Request Example ```bash # Basic usage - compare two PDFs and output PNG to stdout pdf-diff before.pdf after.pdf > comparison.png # Save as different format (gif) pdf-diff before.pdf after.pdf --format gif > comparison.gif # Customize diff marking styles (box for deletions, underline for additions) pdf-diff before.pdf after.pdf --style box,underline > comparison.png # Ignore headers/footers with margin settings (percentage of page height) pdf-diff before.pdf after.pdf --top-margin 5 --bottom-margin 95 > comparison.png # Adjust output image width (default: 900px) pdf-diff before.pdf after.pdf --result-width 1200 > comparison.png # Render from pre-computed changes JSON (read from stdin) cat changes.json | pdf-diff --changes > comparison.png ``` ### Response Outputs a PNG image (or other specified format) to stdout representing the comparison. ``` -------------------------------- ### render_changes(changes, styles, width) Source: https://context7.com/joshdata/pdf-diff/llms.txt Takes a list of change objects and renders them into a side-by-side comparison image with visual annotations. Returns a PIL Image object that can be saved in various formats. ```APIDOC ## render_changes(changes, styles, width) ### Description Takes a list of change objects and renders them into a side-by-side comparison image with visual annotations. Returns a PIL Image object that can be saved in various formats. ### Method Python Function ### Parameters - **changes** (list) - Required - A list of change objects, typically generated by `compute_changes`. - **styles** (list of strings) - Required - A list of two strings specifying the style for deletions and additions, respectively. Available styles: "box", "strike", "underline". Example: `["strike", "underline"]`. - **width** (integer) - Required - The desired width of the output image in pixels. ### Request Example ```python from pdf_diff.command_line import compute_changes, render_changes # Compute differences changes = compute_changes("old.pdf", "new.pdf") # Render with default styles (strike for deletions, underline for additions) styles = ["strike", "underline"] img = render_changes(changes, styles, width=900) # Save as PNG img.save("comparison.png", "PNG") # Render with box style for both styles = ["box", "box"] img = render_changes(changes, styles, width=1200) img.save("comparison_boxes.png", "PNG") ``` ### Response - **img** (PIL.Image.Image) - A PIL Image object representing the side-by-side comparison. This object can be saved to a file using its `save()` method. ``` -------------------------------- ### Render from Pre-computed Changes JSON Source: https://context7.com/joshdata/pdf-diff/llms.txt Generates a comparison image from a JSON file containing pre-computed text changes, read from standard input. ```bash cat changes.json | pdf-diff --changes > comparison.png ``` -------------------------------- ### Save Comparison as Different Image Format Source: https://context7.com/joshdata/pdf-diff/llms.txt Compares two PDFs and saves the output image in a specified format like GIF. ```bash pdf-diff before.pdf after.pdf --format gif > comparison.gif ``` -------------------------------- ### Adjust Output Image Width Source: https://context7.com/joshdata/pdf-diff/llms.txt Compares two PDFs and sets a custom width for the generated comparison image. The default width is 900px. ```bash pdf-diff before.pdf after.pdf --result-width 1200 > comparison.png ``` -------------------------------- ### Render Differences into a Comparison Image Source: https://context7.com/joshdata/pdf-diff/llms.txt Renders a list of change objects into a side-by-side comparison image using specified styles. Returns a PIL Image object. ```python from pdf_diff.command_line import compute_changes, render_changes # Compute differences changes = compute_changes("old.pdf", "new.pdf") # Render with default styles (strike for deletions, underline for additions) styles = ["strike", "underline"] img = render_changes(changes, styles, width=900) # Save as PNG img.save("comparison.png", "PNG") ``` ```python from pdf_diff.command_line import compute_changes, render_changes # Render with box style for both styles = ["box", "box"] img = render_changes(changes, styles, width=1200) img.save("comparison_boxes.png", "PNG") ``` -------------------------------- ### Customize Diff Marking Styles Source: https://context7.com/joshdata/pdf-diff/llms.txt Compares two PDFs and applies custom styles for marking deletions (first style) and additions (second style). ```bash pdf-diff before.pdf after.pdf --style box,underline > comparison.png ``` -------------------------------- ### compute_changes(pdf_fn_1, pdf_fn_2, top_margin=0, bottom_margin=100) Source: https://context7.com/joshdata/pdf-diff/llms.txt Compares two PDF files and returns a list of change objects representing text differences between them. Each change object contains bounding box coordinates, page information, and the changed text content. ```APIDOC ## compute_changes(pdf_fn_1, pdf_fn_2, top_margin=0, bottom_margin=100) ### Description Compares two PDF files and returns a list of change objects representing text differences between them. Each change object contains bounding box coordinates, page information, and the changed text content. ### Method Python Function ### Parameters - **pdf_fn_1** (string) - Required - Path to the first PDF file. - **pdf_fn_2** (string) - Required - Path to the second PDF file. - **top_margin** (integer) - Optional - Percentage of page height to ignore from the top (for headers). Default is 0. - **bottom_margin** (integer) - Optional - Percentage of page height to ignore from the bottom (for footers). Default is 100. ### Request Example ```python from pdf_diff.command_line import compute_changes import json # Compare two PDF documents changes = compute_changes("document_v1.pdf", "document_v2.pdf") # Ignore top 5% and bottom 5% of each page (headers/footers) changes = compute_changes( "document_v1.pdf", "document_v2.pdf", top_margin=5, bottom_margin=95 ) # Output changes as JSON print(json.dumps(changes, indent=2, default=str)) ``` ### Response - **changes** (list) - A list of change objects. Each object represents a text difference and includes details like page number, coordinates, and the text content. Returns an empty list if no changes are found. #### Response Example ```json [ { "index": 0, "pdf": {"index": 0, "file": "document_v1.pdf"}, "page": {"number": 1, "width": 612.0, "height": 792.0}, "x": 72.0, "y": 100.5, "width": 150.0, "height": 12.0, "text": "deleted text ", "startIndex": 0, "textLength": 13 }, "*", # Alignment marker between change groups ... ] ``` ``` -------------------------------- ### Rasterize PDF Page to Image Source: https://context7.com/joshdata/pdf-diff/llms.txt Uses the pdftopng function to convert a specific PDF page into a PIL Image object. Specify the PDF file, page number, and desired width for rasterization. The returned image is in RGBA mode. ```python from pdf_diff.command_line import pdftopng # Rasterize page 1 at 900px width img = pdftopng("document.pdf", 1, 900) img.save("page1.png", "PNG") # Rasterize page 3 at higher resolution img = pdftopng("document.pdf", 3, 1800) img.save("page3_highres.png", "PNG") # The returned image is in RGBA mode print(f"Image size: {img.size}, Mode: {img.mode}") # Output: Image size: (900, 1165), Mode: RGBA ``` -------------------------------- ### Ignore Headers/Footers with Margin Settings Source: https://context7.com/joshdata/pdf-diff/llms.txt Compares two PDFs while ignoring specified top and bottom margins, expressed as a percentage of page height. ```bash pdf-diff before.pdf after.pdf --top-margin 5 --bottom-margin 95 > comparison.png ``` -------------------------------- ### Compute Changes Between Two PDF Documents Source: https://context7.com/joshdata/pdf-diff/llms.txt Compares two PDF files programmatically and returns a list of change objects. Requires importing `compute_changes`. ```python from pdf_diff.command_line import compute_changes import json # Compare two PDF documents changes = compute_changes("document_v1.pdf", "document_v2.pdf") ``` ```python from pdf_diff.command_line import compute_changes import json # Ignore top 5% and bottom 5% of each page (headers/footers) changes = compute_changes( "document_v1.pdf", "document_v2.pdf", top_margin=5, bottom_margin=95 ) ``` ```python from pdf_diff.command_line import compute_changes import json # Output changes as JSON print(json.dumps(changes, indent=2, default=str)) ``` -------------------------------- ### pdf_to_bboxes(pdf_index, fn, top_margin=0, bottom_margin=100) Source: https://context7.com/joshdata/pdf-diff/llms.txt Generator function that extracts text bounding boxes from a PDF file using pdftotext. Yields dictionaries containing position, dimensions, and text content for each word. ```APIDOC ## pdf_to_bboxes(pdf_index, fn, top_margin=0, bottom_margin=100) ### Description Generator function that extracts text bounding boxes from a PDF file using pdftotext. Yields dictionaries containing position, dimensions, and text content for each word. ### Method Python Function ### Parameters - **pdf_index** (integer) - Required - An index for the PDF file (used internally). - **fn** (string) - Required - Path to the PDF file. - **top_margin** (integer) - Optional - Percentage of page height to ignore from the top. Default is 0. - **bottom_margin** (integer) - Optional - Percentage of page height to ignore from the bottom. Default is 100. ### Request Example ```python from pdf_diff.command_line import pdf_to_bboxes # Extract all text bounding boxes from a PDF for bbox in pdf_to_bboxes(0, "document.pdf"): print(f"Page {bbox['page']['number']}: '{bbox['text']}' at ({bbox['x']}, {bbox['y']})") # With margin filtering (ignore top 10% and bottom 10%) for bbox in pdf_to_bboxes(0, "document.pdf", top_margin=10, bottom_margin=90): print(f"Text: {bbox['text']}, Width: {bbox['width']}, Height: {bbox['height']}") ``` ### Response - **bbox** (dict) - A dictionary representing a text bounding box. Each dictionary contains: - `index` (integer): Sequential box index. - `pdf` (dict): Information about the PDF file (`index`, `file`). - `page` (dict): Information about the page (`number`, `width`, `height`). - `x` (float): Left edge coordinate (PDF coordinates). - `y` (float): Top edge coordinate (PDF coordinates). - `width` (float): Width of the bounding box. - `height` (float): Height of the bounding box. - `text` (string): The extracted text content within the bounding box. #### Response Example ```json { "index": 0, "pdf": {"index": 0, "file": "document.pdf"}, "page": {"number": 1, "width": 612.0, "height": 792.0}, "x": 72.0, "y": 720.5, "width": 45.2, "height": 11.0, "text": "Hello" } ``` ``` -------------------------------- ### Function Flow Diagram Source: https://github.com/joshdata/pdf-diff/blob/primary/README.md A diagram illustrating the flow of operations within the pdf-diff script, from computing changes to rendering and stacking pages. ```text compute_changes │ ├── serialize_pdf (called twice) │ ├── pdf_to_bboxes │ ├── mark_eol_hyphens │ │ └── mark_eol_hyphen │ └── Processes bounding boxes and text │ ├── perform_diff │ └── Calls external `fast_diff_match_patch` │ └── process_hunks ├── Iterates through diff hunks └── mark_difference (called multiple times) render_changes │ ├── simplify_changes ├── make_pages_images │ └── pdftopng (converts PDF pages to images) ├── realign_pages │ ├── Splits pages into sub-pages │ └── Adjusts box coordinates ├── draw_red_boxes │ └── Annotates images with rectangles or lines └── zealous_crop └── Crops the image to reduce unnecessary margins stack_pages │ └── Combines processed images into a final output ``` -------------------------------- ### Extract Text Bounding Boxes from PDF Source: https://context7.com/joshdata/pdf-diff/llms.txt Generator function to extract text bounding boxes from a PDF file using pdftotext. Yields dictionaries with position, dimensions, and text. ```python from pdf_diff.command_line import pdf_to_bboxes # Extract all text bounding boxes from a PDF for bbox in pdf_to_bboxes(0, "document.pdf"): print(f"Page {bbox['page']['number']}: '{bbox['text']}' at ({bbox['x']}, {bbox['y']})") ``` ```python from pdf_diff.command_line import pdf_to_bboxes # With margin filtering (ignore top 10% and bottom 10%) for bbox in pdf_to_bboxes(0, "document.pdf", top_margin=10, bottom_margin=90): print(f"Text: {bbox['text']}, Width: {bbox['width']}, Height: {bbox['height']}") ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.