### Initialize pdfplumber and Check Version Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/ag-energy-roundup-curves.ipynb This snippet shows how to import the pdfplumber library and print its version. It's a basic setup step for using the library. ```python import pdfplumber print(pdfplumber.__version__) ``` -------------------------------- ### Install pdfplumber using pip Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md Installs the pdfplumber library using pip, the Python package installer. This is the standard method for adding the library to your Python environment. ```shell pip install pdfplumber ``` -------------------------------- ### Command Line Interface Extraction Source: https://context7.com/jsvine/pdfplumber/llms.txt Provides examples of using the pdfplumber CLI to extract PDF content into various formats like CSV and JSON, and to inspect the document structure tree. ```bash pdfplumber document.pdf > output.csv pdfplumber document.pdf --format json > output.json pdfplumber document.pdf --pages 1 3-5 10 pdfplumber document.pdf --structure --indent 2 ``` -------------------------------- ### GET /structure Source: https://context7.com/jsvine/pdfplumber/llms.txt Retrieves the logical structure tree of the PDF, including headings, paragraphs, and lists. ```APIDOC ## GET /structure ### Description Retrieves the semantic structure tree of the PDF document or a specific page if available. ### Method GET ### Endpoint /structure ### Parameters #### Query Parameters - **page_number** (int) - Optional - The specific page to retrieve the structure for. ### Response #### Success Response (200) - **structure** (object) - The hierarchical structure tree of the document. #### Response Example { "type": "Document", "children": [ { "type": "H1", "text": "Title" }, { "type": "P", "text": "Paragraph content" } ] } ``` -------------------------------- ### Search Text in PDF with pdfplumber Source: https://context7.com/jsvine/pdfplumber/llms.txt Demonstrates how to search for text in a PDF page using literal strings or regular expressions. It includes examples of case-insensitive searches, regex group extraction, and accessing character-level metadata. ```python import pdfplumber import re with pdfplumber.open("document.pdf") as pdf: page = pdf.pages[0] # Search for literal string results = page.search("invoice", regex=False, case=False) for match in results: print(f"Found '{match['text']}' at ({match['x0']}, {match['top']})") # Search with regex pattern results = page.search(r"\$[\d,]+\.\d{2}") # Match currency amounts for match in results: print(f"Amount found: {match['text']}") print(f"Bounding box: ({match['x0']}, {match['top']}, {match['x1']}, {match['bottom']})") # Search with compiled regex and get regex groups pattern = re.compile(r"(\w+)@(\w+\.\w+)") # Email pattern results = page.search( pattern, return_groups=True, return_chars=True, main_group=0 ) for match in results: print(f"Email: {match['text']}") print(f"Groups: {match['groups']}") print(f"Character objects: {len(match['chars'])} chars") # Case-insensitive search results = page.search("TOTAL", case=False) ``` -------------------------------- ### GET .extract_words Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md Returns a list of all word-looking objects along with their bounding boxes and optional attributes. ```APIDOC ## GET .extract_words ### Description Identifies sequences of characters as words based on x/y tolerance and returns their bounding boxes. Supports advanced features like ligature expansion and character attribute grouping. ### Method GET ### Endpoint .extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, ...) ### Parameters #### Query Parameters - **x_tolerance** (int) - Optional - Horizontal threshold for word grouping. - **y_tolerance** (int) - Optional - Vertical threshold for word grouping. - **extra_attrs** (list) - Optional - List of character attributes to group by. ### Response #### Success Response (200) - **words** (list) - A list of dictionaries containing word text, bounding boxes, and optional attributes. ``` -------------------------------- ### Specify Ghostscript path for PDF repair in pdfplumber Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/repairing.md When repairing PDFs, you can explicitly provide the path to the Ghostscript executable using the `gs_path` argument. This is helpful if pdfplumber cannot automatically locate your Ghostscript installation. This parameter can be used with any of the repair methods. ```python import pdfplumber # Example using repair and saving to file with custom gs_path pdfplumber.repair("malformed.pdf", outfile="repaired.pdf", gs_path="/usr/local/bin/gs") # Example using open with repair and custom gs_path # with pdfplumber.open("malformed.pdf", repair=True, gs_path="/usr/local/bin/gs") as pdf: # print(pdf.pages[0].extract_text()) ``` -------------------------------- ### Repair PDF and get bytes with pdfplumber Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/repairing.md This function repairs a PDF file and returns its content as a BytesIO object. This is useful when you need to process the repaired PDF content in memory without saving it to a new file. It takes the path to the malformed PDF as input. ```python import pdfplumber from io import BytesIO repaired_pdf_bytes: BytesIO = pdfplumber.repair("malformed.pdf") # You can now use repaired_pdf_bytes, for example, to open it again with pdfplumber # with pdfplumber.open(repaired_pdf_bytes) as pdf: # print(pdf.pages[0].extract_text()) ``` -------------------------------- ### Initialize pdfplumber and Load PDF Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/extract-table-nics.ipynb Initializes the pdfplumber library and opens a target PDF file for processing. ```python import pdfplumber pdf = pdfplumber.open("../pdfs/background-checks.pdf") ``` -------------------------------- ### Working with Transformations and Coordinates Source: https://context7.com/jsvine/pdfplumber/llms.txt Explains how to access character transformation matrices (CTM) and page-level coordinate properties such as MediaBox, CropBox, and doctop values. ```python import pdfplumber from pdfplumber.ctm import CTM with pdfplumber.open("document.pdf") as pdf: page = pdf.pages[0] # Access character transformation matrix for char in page.chars[:3]: if "matrix" in char: ctm = CTM(*char["matrix"]) print(f"Character: '{char['text']}'") print(f" Position: ({char['x0']}, {char['top']})") print(f" Rotation/skew: {ctm.skew_x}") # Page coordinate properties print(f"MediaBox: {page.mediabox}") print(f"CropBox: {page.cropbox}") print(f"BBox: {page.bbox}") print(f"Rotation: {page.rotation}") # Document-level coordinates (doctop spans all pages) for char in page.chars[:5]: print(f"'{char['text']}': top={char['top']}, doctop={char['doctop']}") ``` -------------------------------- ### GET .extract_text Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md Collates all of the page's character objects into a single string, with options for layout preservation. ```APIDOC ## GET .extract_text ### Description Collates all of the page's character objects into a single string. When layout=False, it uses tolerance parameters to insert spaces and newlines. When layout=True, it attempts to mimic the visual structure of the page. ### Method GET ### Endpoint .extract_text(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, layout=False, x_density=7.25, y_density=13, line_dir_render=None, char_dir_render=None, **kwargs) ### Parameters #### Query Parameters - **x_tolerance** (int) - Optional - Horizontal distance threshold for spacing. - **y_tolerance** (int) - Optional - Vertical distance threshold for newlines. - **layout** (boolean) - Optional - Whether to attempt to preserve visual layout. ### Response #### Success Response (200) - **text** (string) - The extracted text content from the page. ``` -------------------------------- ### Initialize pdfplumber and Open PDF Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/extract-table-ca-warn-report.ipynb Imports the pdfplumber library and opens a target PDF file for processing. This is the foundational step for any extraction task. ```python import pdfplumber print(pdfplumber.__version__) pdf = pdfplumber.open("../pdfs/ca-warn-report.pdf") ``` -------------------------------- ### Open and Load PDFs with pdfplumber Source: https://context7.com/jsvine/pdfplumber/llms.txt Demonstrates various ways to open PDF files using pdfplumber.open(), including basic usage, password-protected files, custom layout analysis parameters, Unicode normalization, opening specific pages, and loading from a BytesIO stream. It shows how to access page count, metadata, and page dimensions. ```python import pdfplumber # Basic usage - open a PDF file with pdfplumber.open("document.pdf") as pdf: print(f"Number of pages: {len(pdf.pages)}") print(f"Metadata: {pdf.metadata}") # Access first page first_page = pdf.pages[0] print(f"Page dimensions: {first_page.width} x {first_page.height}") # Open password-protected PDF with pdfplumber.open("protected.pdf", password="secret123") as pdf: text = pdf.pages[0].extract_text() print(text) # Open with layout analysis parameters for higher-level text objects with pdfplumber.open("document.pdf", laparams={"line_overlap": 0.7}) as pdf: page = pdf.pages[0] # Access textboxhorizontal objects when laparams is set textboxes = page.textboxhorizontals for box in textboxes: print(box["text"]) # Open with Unicode normalization with pdfplumber.open("document.pdf", unicode_norm="NFC") as pdf: text = pdf.pages[0].extract_text() # Open specific pages only (1-indexed) with pdfplumber.open("large_document.pdf", pages=[1, 5, 10]) as pdf: for page in pdf.pages: print(f"Page {page.page_number}: {page.extract_text()[:100]}") # Open from BytesIO stream from io import BytesIO with open("document.pdf", "rb") as f: stream = BytesIO(f.read()) with pdfplumber.open(stream) as pdf: print(pdf.pages[0].extract_text()) ``` -------------------------------- ### GET /page/extract_text Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/san-jose-pd-firearm-report.ipynb Extracts all text content from a specific PDF page object, allowing for configuration of whitespace handling. ```APIDOC ## GET /page/extract_text ### Description Extracts the text from a PDF page object. This method processes the page content line by line and can be configured to preserve or strip whitespace characters. ### Method GET ### Endpoint Page.extract_text() ### Parameters #### Query Parameters - **keep_blank_chars** (boolean) - Optional - If set to True, retains all whitespace characters as literal characters in the output. ### Request Example ```python text = page.extract_text(keep_blank_chars=True) ``` ### Response #### Success Response (200) - **text** (string) - The full text content extracted from the PDF page. #### Response Example ```text For:1094N Page 1 SAN JOSE POLICE DEPT Date Report Run : Tue, May-24-16 FIREARM SEARCH ... (rest of the extracted text) ``` ``` -------------------------------- ### Using PDFStructTree for Visual Debugging Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/structure.md Demonstrates how to use the `PDFStructTree` class for advanced analysis, including plotting bounding boxes of specific element types. ```APIDOC ## Using PDFStructTree for Visual Debugging ### Description This example utilizes the `PDFStructTree` class to perform more advanced operations, such as finding all elements of a specific type (e.g., 'TD') and drawing their bounding boxes on a page image. ### Method ```python import pdfplumber from pdfplumber.structuretree import PDFStructTree # Assuming 'pdffile' is the path to your PDF file with pdfplumber.open(pdffile) as pdf: page = pdf.pages[0] # Get the first page stree = PDFStructTree(pdf, page) # Initialize PDFStructTree for the page img = page.to_image() # Convert page to an image object # Find all 'TD' elements and draw their bounding boxes td_elements = [td for td in stree.find_all("TD")] img.draw_rects(stree.element_bbox(td) for td in td_elements) # To save or display the image: # img.save("page_with_td_bboxes.png") # img.show() ``` ### Endpoint N/A (This is a library usage example) ### Parameters N/A ### Request Body N/A ### Response N/A (Generates an image with bounding boxes drawn) ### Notes - `PDFStructTree(pdf, page)` initializes the structure tree analysis for a given page. - `stree.find_all(element_name)` searches for elements by name, regex, or function, similar to BeautifulSoup. - `stree.element_bbox(element)` returns the bounding box of a given structure element. ``` -------------------------------- ### pdfplumber Python: Loading a PDF with options Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md Illustrates various ways to load a PDF file using pdfplumber, including specifying file paths, byte streams, and file-like objects. It also shows how to handle password-protected PDFs and configure layout analysis parameters. ```python import pdfplumber # Load from file path with pdfplumber.open("path/to/file.pdf") as pdf: pass # Load from file object (bytes) with open("path/to/file.pdf", "rb") as f: with pdfplumber.open(f) as pdf: pass # Load password-protected PDF with pdfplumber.open("file.pdf", password = "test") as pdf: pass # Set layout analysis parameters with pdfplumber.open("file.pdf", laparams = { "line_overlap": 0.7 }) as pdf: pass # Pre-normalize Unicode text with pdfplumber.open("file.pdf", unicode_norm="NFC") as pdf: pass # Strict metadata parsing with pdfplumber.open("file.pdf", strict_metadata=True) as pdf: pass ``` -------------------------------- ### Visual Debugging with pdfplumber Source: https://context7.com/jsvine/pdfplumber/llms.txt Demonstrates how to render PDF pages as images and overlay annotations such as rectangles, lines, and table detection results. This is essential for troubleshooting extraction logic and verifying object coordinates. ```python import pdfplumber with pdfplumber.open("document.pdf") as pdf: page = pdf.pages[0] im = page.to_image(resolution=150) im.draw_rects(page.chars, stroke=(255, 0, 0), fill=(255, 0, 0, 50)) im.draw_lines(page.lines, stroke=(0, 0, 255), stroke_width=2) im.debug_tablefinder({"vertical_strategy": "lines", "horizontal_strategy": "lines"}) im.save("debug_output.png", format="PNG") im.show() ``` -------------------------------- ### Access and Visualize PDF Pages Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/extract-table-ca-warn-report.ipynb Demonstrates how to retrieve a specific page object from the PDF and convert it into an image for visual inspection. ```python p0 = pdf.pages[0] im = p0.to_image() im ``` -------------------------------- ### Load PDF Page with pdfplumber Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/ag-energy-roundup-curves.ipynb Demonstrates loading a specific page from a PDF file using pdfplumber. It opens a PDF and selects the first page for further processing. ```python report = pdfplumber.open("../pdfs/ag-energy-round-up-2017-02-24.pdf").pages[0] ``` -------------------------------- ### Python: Open and Read PDF Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md The Python library provides a context manager to open and interact with PDF files programmatically. ```APIDOC ## POST /pdfplumber/open ### Description Opens a PDF file for reading and returns a PDF object instance. ### Method Python Method: pdfplumber.open(path, **kwargs) ### Parameters #### Request Body - **path** (string) - Required - Path to the PDF file or file-like object. - **password** (string) - Optional - Password for protected PDFs. - **laparams** (dict) - Optional - Layout analysis parameters for pdfminer.six. - **unicode_norm** (string) - Optional - Unicode normalization form (NFC, NFD, NFKC, NFKD). ### Request Example import pdfplumber with pdfplumber.open("file.pdf", password="secret") as pdf: page = pdf.pages[0] ### Response #### Success Response (200) - **pdf** (object) - An instance of the pdfplumber.PDF class. ``` -------------------------------- ### Accessing PDF Logical Structure Tree Source: https://context7.com/jsvine/pdfplumber/llms.txt Shows how to retrieve and traverse the logical structure tree of a PDF, enabling the identification of semantic elements like headings and paragraphs. ```python import pdfplumber import json with pdfplumber.open("structured_document.pdf") as pdf: # Get structure tree for entire document structure = pdf.structure_tree if structure: print(json.dumps(structure, indent=2)) else: print("No structure tree found") # Get structure tree for specific page page = pdf.pages[0] page_structure = page.structure_tree # Navigate structure tree def print_structure(elements, indent=0): for elem in elements: print(" " * indent + f"{elem.get('type', 'unknown')}") if 'children' in elem: print_structure(elem['children'], indent + 1) print_structure(structure) ``` -------------------------------- ### Working with Element Attributes (BBox) Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/structure.md Illustrates how to access and convert bounding box (BBox) attributes from PDF coordinate space to pdfplumber's coordinate space. ```APIDOC ## Working with Element Attributes (BBox) ### Description This example shows how to extract and convert the `BBox` attribute from a structure element, such as a `Table`, `Figure`, or `Image`, into pdfplumber's coordinate system. ### Method ```python # Assuming 'element' is a structure tree element with a 'BBox' attribute # Assuming 'page' is a pdfplumber Page object x0, y0, x1, y1 = element['attributes']['BBox'] top = page.height - y1 bottom = page.height - y0 doctop = page.initial_doctop + top bbox = (x0, top, x1, bottom) print(f"Original BBox: ({x0}, {y0}, {x1}, {y1})") print(f"Converted BBox in pdfplumber space: {bbox}") ``` ### Endpoint N/A (This is a library usage example) ### Parameters N/A ### Request Body N/A ### Response N/A (Prints to console) ### Notes - PDF coordinate space has the origin at the bottom-left. - `page.height` and `page.initial_doctop` are used for conversion. - The `BBox` attribute provides the bounding box coordinates. ``` -------------------------------- ### Parse and Display Raw Data (Python) Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/san-jose-pd-firearm-report.ipynb This snippet demonstrates how to parse data from a PDF report and display the first two entries of the parsed data. It assumes the data has already been parsed into a list of dictionaries. ```python parsed[:2] ``` -------------------------------- ### Calculate Character Rotation using CTM Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md Demonstrates how to instantiate a CTM object from a character's matrix property to calculate rotation or skew. This is useful for determining the orientation of text within a PDF. ```python from pdfplumber.ctm import CTM my_char = pdf.pages[0].chars[3] my_char_ctm = CTM(*my_char["matrix"]) my_char_rotation = my_char_ctm.skew_x ``` -------------------------------- ### Import Pandas and Create DataFrame Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/extract-table-ca-warn-report.ipynb Imports the pandas library and creates a DataFrame from extracted table data. Assumes 'table' is a pre-existing list of lists. ```python import pandas as pd df = pd.DataFrame(table[1:], columns=table[0]) ``` -------------------------------- ### Extract Text from PDF Pages with pdfplumber Source: https://context7.com/jsvine/pdfplumber/llms.txt Illustrates different methods for extracting text from PDF pages using pdfplumber. Includes basic extraction, layout-preserving extraction, customized extraction with tolerances and density parameters, simple/fast extraction, and extracting individual words with bounding boxes and character details. Also shows how to extract text lines. ```python import pdfplumber with pdfplumber.open("document.pdf") as pdf: page = pdf.pages[0] # Basic text extraction text = page.extract_text() print(text) # Extract text with layout preservation (mimics visual structure) text_with_layout = page.extract_text(layout=True) print(text_with_layout) # Customized text extraction with tolerances text = page.extract_text( x_tolerance=3, # Horizontal spacing tolerance y_tolerance=3, # Vertical spacing tolerance layout=True, x_density=7.25, # Characters per point (horizontal) y_density=13 # Lines per point (vertical) ) # Simple/fast text extraction simple_text = page.extract_text_simple(x_tolerance=3, y_tolerance=3) # Extract words with bounding boxes words = page.extract_words( x_tolerance=3, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, extra_attrs=["fontname", "size"], # Include font info per word split_at_punctuation=True, expand_ligatures=True, return_chars=True # Include individual char objects ) for word in words[:5]: print(f"Word: '{word['text']}' at ({word['x0']}, {word['top']}) - ({word['x1']}, {word['bottom']})") if 'fontname' in word: print(f" Font: {word['fontname']}, Size: {word['size']}") # Extract text lines with character details text_lines = page.extract_text_lines( layout=True, strip=True, return_chars=True ) for line in text_lines[:3]: print(f"Line: '{line['text']}' at top={line['top']}") ``` -------------------------------- ### Extract Tables from PDF using pdfplumber Source: https://context7.com/jsvine/pdfplumber/llms.txt Demonstrates various methods for extracting tables from PDF pages using pdfplumber. It covers default extraction, extracting the largest table, finding table objects with metadata, and using custom settings for tables without visible lines, explicit lines, or strict line detection. It also shows how to debug the table finder. ```python import pdfplumber with pdfplumber.open("document_with_tables.pdf") as pdf: page = pdf.pages[0] # Extract all tables from page (returns list of 2D arrays) tables = page.extract_tables() for i, table in enumerate(tables): print(f"Table {i+1}:") for row in table: print(row) # Extract the largest table table = page.extract_table() if table: headers = table[0] for row in table[1:]: print(dict(zip(headers, row))) # Find table objects (with metadata like cells, bbox) table_objects = page.find_tables() for tbl in table_objects: print(f"Table bbox: {tbl.bbox}") print(f"Cells: {len(tbl.cells)}") print(f"Rows: {len(tbl.rows)}") # Extract table data data = tbl.extract() print(data) # Custom table settings for tables without visible lines table_settings = { "vertical_strategy": "text", # Use text alignment "horizontal_strategy": "text", "min_words_vertical": 3, "min_words_horizontal": 1, "snap_tolerance": 3, "join_tolerance": 3, "edge_min_length": 3, "intersection_tolerance": 3, "text_x_tolerance": 3, "text_y_tolerance": 3, } tables = page.extract_tables(table_settings) # Use explicit lines strategy table_settings = { "vertical_strategy": "explicit", "horizontal_strategy": "explicit", "explicit_vertical_lines": [50, 150, 300, 450], # x-coordinates "explicit_horizontal_lines": [100, 200, 300, 400], # y-coordinates } tables = page.extract_tables(table_settings) # Use lines_strict (only actual lines, not rectangle edges) table_settings = { "vertical_strategy": "lines_strict", "horizontal_strategy": "lines_strict", } tables = page.extract_tables(table_settings) # Debug table finder to understand detection finder = page.debug_tablefinder(table_settings) print(f"Edges found: {len(finder.edges)}") print(f"Intersections: {len(finder.intersections)}") print(f"Cells: {len(finder.cells)}") print(f"Tables: {len(finder.tables)}") ``` -------------------------------- ### Sort Data by Handgun Checks in Python Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/extract-table-nics.ipynb Demonstrates sorting the processed data (`data`) in descending order based on the 'handgun' count and prints the top 6 entries, formatted for readability. This helps identify states with the highest handgun-only background checks. ```python for row in list(reversed(sorted(data, key=lambda x: x["handgun"])))[:6]: print("{state}: {handgun:,d} handgun-only checks".format(**row)) ``` -------------------------------- ### Crop and Filter PDF Pages with pdfplumber Source: https://context7.com/jsvine/pdfplumber/llms.txt Illustrates how to manipulate PDF pages using pdfplumber. This includes cropping pages to specific regions using bounding boxes, filtering objects based on custom criteria (like size or color), removing duplicate characters, and chaining these operations for complex page processing. ```python import pdfplumber with pdfplumber.open("document.pdf") as pdf: page = pdf.pages[0] # Crop to specific bounding box (x0, top, x1, bottom) cropped = page.crop((0, 0, page.width / 2, page.height / 2)) text = cropped.extract_text() print(f"Text from top-left quarter: {text[:100]}") # Crop with relative coordinates (offset from page origin) cropped = page.crop((50, 100, 300, 400), relative=True) # Get only objects fully within a bounding box within = page.within_bbox((100, 100, 400, 400)) print(f"Objects fully within box: {len(within.chars)} chars") # Get objects outside a bounding box outside = page.outside_bbox((100, 100, 400, 400)) print(f"Objects outside box: {len(outside.chars)} chars") # Filter objects with custom function def is_large_text(obj): return obj.get("size", 0) > 12 filtered = page.filter(is_large_text) large_text = filtered.extract_text() print(f"Large text only: {large_text}") # Filter by color def is_red_text(obj): color = obj.get("non_stroking_color", (0, 0, 0)) if isinstance(color, tuple) and len(color) >= 3: return color[0] > 0.5 and color[1] < 0.3 and color[2] < 0.3 return False red_text_page = page.filter(is_red_text) print(f"Red text: {red_text_page.extract_text()}") # Remove duplicate characters deduped = page.dedupe_chars(tolerance=1, extra_attrs=("fontname", "size")) print(f"Original chars: {len(page.chars)}, Deduped: {len(deduped.chars)}") # Chain operations result = ( page .crop((50, 50, 500, 700)) .filter(lambda obj: obj.get("size", 0) > 10) .dedupe_chars() ) text = result.extract_text() ``` -------------------------------- ### pdfplumber.open() Source: https://context7.com/jsvine/pdfplumber/llms.txt Opens a PDF file or stream and returns a PDF object for accessing pages and metadata. ```APIDOC ## GET /pdfplumber/open ### Description Opens a PDF document from a file path or stream. Supports password protection, layout analysis parameters, and Unicode normalization. ### Method GET (Library Function) ### Parameters #### Path Parameters - **path** (string/file-like) - Required - The file path, file object, or BytesIO stream of the PDF. #### Query Parameters - **password** (string) - Optional - Password for protected PDFs. - **laparams** (dict) - Optional - Layout analysis parameters for text object detection. - **unicode_norm** (string) - Optional - Unicode normalization form (e.g., 'NFC'). - **pages** (list) - Optional - List of specific page numbers to load. ### Request Example pdfplumber.open("document.pdf", password="secret123", laparams={"line_overlap": 0.7}) ### Response #### Success Response (200) - **pdf** (object) - A PDF object containing metadata and a list of page objects. ``` -------------------------------- ### Accessing Page Structure Tree Elements Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/structure.md Demonstrates how to iterate through the structure tree of a specific PDF page to retrieve element types and their associated marked content IDs. ```python with pdfplumber.open(pdffile) as pdf: for element in pdf.pages[0].structure_tree: print(element["type"], element["mcids"]) for child in element.children: print(child["type"], child["mcids"]) ``` -------------------------------- ### Repairing Malformed PDFs Source: https://context7.com/jsvine/pdfplumber/llms.txt Shows how to use the repair utility to fix corrupted PDF files using Ghostscript. It supports direct file output, stream processing, and integration with pdfplumber's open method. ```python from pdfplumber import repair import pdfplumber # Repair and save to file repair("malformed.pdf", outfile="repaired.pdf") # Open with automatic repair with pdfplumber.open("malformed.pdf", repair=True) as pdf: text = pdf.pages[0].extract_text() ``` -------------------------------- ### Apply Explicit Table Extraction Strategy Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md Demonstrates the 'explicit' strategy for pdfplumber table extraction, which relies solely on lines defined in 'explicit_vertical_lines' and 'explicit_horizontal_lines'. This offers the highest level of control over table structure definition. ```python "vertical_strategy": "explicit", "horizontal_strategy": "explicit" ``` -------------------------------- ### Accessing Structure Tree for a Page Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/structure.md Demonstrates how to iterate through the structure tree of a specific page and access element types and marked content IDs (MCIDs). ```APIDOC ## Accessing Structure Tree for a Page ### Description This code snippet shows how to access the structure tree for a specific page in a PDF document and iterate through its elements, printing their types and associated MCIDs. ### Method ```python with pdfplumber.open(pdffile) as pdf: for element in pdf.pages[0].structure_tree: print(element["type"], element["mcids"]) for child in element.children: print(child["type"], child["mcids"]) ``` ### Endpoint N/A (This is a library usage example) ### Parameters N/A ### Request Body N/A ### Response N/A (Prints to console) ### Notes - The `type` field indicates the structure element type (e.g., 'P', 'H1', 'Table'). - The `mcids` field is a list of marked content section IDs related to the element. - Additional fields like `lang`, `alt_text`, and `attributes` may be present. ``` -------------------------------- ### pdfplumber Python: Basic PDF page access Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md Demonstrates the basic usage of the pdfplumber Python library to open a PDF file and access the first page's characters. It shows how to import the library and iterate through page elements. ```python import pdfplumber with pdfplumber.open("path/to/file.pdf") as pdf: first_page = pdf.pages[0] print(first_page.chars[0]) ``` -------------------------------- ### Visualizing Element Bounding Boxes with PDFStructTree Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/structure.md Uses the PDFStructTree object to find specific elements like table cells (TD) and draw their bounding boxes on a page image. ```python page = pdf.pages[0] stree = PDFStructTree(pdf, page) img = page.to_image() img.draw_rects(stree.element_bbox(td) for td in table.find_all("TD")) ``` -------------------------------- ### Configure pdfplumber Table Extraction Settings Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md This snippet shows the default settings dictionary for pdfplumber's extract_tables method. These settings control how tables are identified and extracted, including strategies for vertical and horizontal separation, tolerance levels for snapping and joining lines, and text extraction parameters. ```python { "vertical_strategy": "lines", "horizontal_strategy": "lines", "explicit_vertical_lines": [], "explicit_horizontal_lines": [], "snap_tolerance": 3, "snap_x_tolerance": 3, "snap_y_tolerance": 3, "join_tolerance": 3, "join_x_tolerance": 3, "join_y_tolerance": 3, "edge_min_length": 3, "edge_min_length_prefilter": 1, "min_words_vertical": 3, "min_words_horizontal": 1, "intersection_tolerance": 3, "intersection_x_tolerance": 3, "intersection_y_tolerance": 3, "text_tolerance": 3, "text_x_tolerance": 3, "text_y_tolerance": 3, "text_*": "…" } ``` -------------------------------- ### Table Extraction Settings API Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md This section details the `table_settings` argument for the `extract_tables` method in pdfplumber, outlining all available configuration options and their default values. ```APIDOC ## Table-extraction settings By default, `extract_tables` uses the page's vertical and horizontal lines (or rectangle edges) as cell-separators. But the method is highly customizable via the `table_settings` argument. The possible settings, and their defaults: ```python { "vertical_strategy": "lines", "horizontal_strategy": "lines", "explicit_vertical_lines": [], "explicit_horizontal_lines": [], "snap_tolerance": 3, "snap_x_tolerance": 3, "snap_y_tolerance": 3, "join_tolerance": 3, "join_x_tolerance": 3, "join_y_tolerance": 3, "edge_min_length": 3, "edge_min_length_prefilter": 1, "min_words_vertical": 3, "min_words_horizontal": 1, "intersection_tolerance": 3, "intersection_x_tolerance": 3, "intersection_y_tolerance": 3, "text_tolerance": 3, "text_x_tolerance": 3, "text_y_tolerance": 3, "text_*": …, } ``` | Setting | | Description | |---------|-------------| |`"vertical_strategy"`| Either `"lines"`, `"lines_strict"`, `"text"`, or `"explicit"`. See explanation below.| |`"horizontal_strategy"`| Either `"lines"`, `"lines_strict"`, `"text"`, or `"explicit"`. See explanation below.| |`"explicit_vertical_lines"`| A list of vertical lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers — indicating the `x` coordinate of a line the full height of the page — or `line`/`rect`/`curve` objects.| |`"explicit_horizontal_lines"`| A list of horizontal lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers — indicating the `y` coordinate of a line the full height of the page — or `line`/`rect`/`curve` objects.| |`"snap_tolerance"`, `"snap_x_tolerance"`, `"snap_y_tolerance"`| Parallel lines within `snap_tolerance` points will be "snapped" to the same horizontal or vertical position.| |`"join_tolerance"`, `"join_x_tolerance"`, `"join_y_tolerance"`| Line segments on the same infinite line, and whose ends are within `join_tolerance` of one another, will be "joined" into a single line segment.| |`"edge_min_length"`| Edges shorter than `edge_min_length` will be discarded before attempting to reconstruct the table.| |`"edge_min_length_prefilter"`| Edges shorter than `edge_min_length_prefilter` will be discarded during initial edge filtering from the page. Lowering this value (e.g., to `0.5`) can help capture small dashed lines that might otherwise be filtered out.| |`"min_words_vertical"`| When using `"vertical_strategy": "text"`, at least `min_words_vertical` words must share the same alignment.| |`"min_words_horizontal"`| When using `"horizontal_strategy": "text"`, at least `min_words_horizontal` words must share the same alignment.| |`"intersection_tolerance"`, `"intersection_x_tolerance"`, `"intersection_y_tolerance"`| When combining edges into cells, orthogonal edges must be within `intersection_tolerance` points to be considered intersecting.| |`"text_*"`| All settings prefixed with `text_` are then used when extracting text from each discovered table. All possible arguments to `Page.extract_text(...)` are also valid here.| |`"text_x_tolerance"`, `"text_y_tolerance"`| These `text_`-prefixed settings *also* apply to the table-identification algorithm when the `text` strategy is used. I.e., when that algorithm searches for words, it will expect the individual letters in each word to be no more than `text_x_tolerance`/`text_y_tolerance` points apart.| ``` -------------------------------- ### Configure Text-Based Table Extraction Strategy Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md Shows how to configure the 'text' strategy for vertical and horizontal table extraction in pdfplumber. This strategy deduces cell boundaries based on word alignment, requiring a minimum number of words ('min_words_vertical', 'min_words_horizontal') to share an alignment. ```python "vertical_strategy": "text", "horizontal_strategy": "text", "min_words_vertical": 5, "min_words_horizontal": 2 ``` -------------------------------- ### Converting PDF BBox Coordinates to pdfplumber Space Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/structure.md Shows how to transform a BBox attribute from the PDF coordinate system (origin at bottom-left) to the pdfplumber coordinate system. ```python x0, y0, x1, y1 = element['attributes']['BBox'] top = page.height - y1 bottom = page.height - y0 doctop = page.initial_doctop + top bbox = (x0, top, x1, bottom) ``` -------------------------------- ### Notes on Table Extraction Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md Additional notes and considerations for using pdfplumber's table extraction feature, including version changes. ```APIDOC ## Notes - Often it's helpful to crop a page — `Page.crop(bounding_box)` — before trying to extract the table. - Table extraction for `pdfplumber` was radically redesigned for `v0.5.0`, and introduced breaking changes. ``` -------------------------------- ### Crop Page Before Table Extraction Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md Provides a note on a common practice in pdfplumber: cropping a page using Page.crop(bounding_box) before attempting table extraction. This can help isolate specific table areas and improve extraction accuracy. ```python # Example usage (assuming 'page' is a pdfplumber Page object) # bounding_box = (x0, y0, x1, y1) # cropped_page = page.crop(bounding_box) # cropped_page.extract_tables() ``` -------------------------------- ### Access PDF Objects and Attributes with pdfplumber Source: https://context7.com/jsvine/pdfplumber/llms.txt Shows how to retrieve and inspect various PDF objects such as characters, lines, rectangles, curves, images, and annotations. Each object provides detailed attributes like coordinates, styling, and color information. ```python import pdfplumber with pdfplumber.open("document.pdf") as pdf: page = pdf.pages[0] # Access all objects dictionary print(f"Object types found: {page.objects.keys()}") # Access character objects for char in page.chars[:5]: print(f"Char: '{char['text']}' at ({char['x0']}, {char['top']})") print(f" Font: {char['fontname']}, Size: {char['size']}") print(f" Color: {char['non_stroking_color']}") # Access line objects for line in page.lines[:3]: print(f"Line from ({line['x0']}, {line['top']}) to ({line['x1']}, {line['bottom']})") print(f" Width: {line['linewidth']}, Color: {line['stroking_color']}") # Access rectangle objects for rect in page.rects[:3]: print(f"Rect: ({rect['x0']}, {rect['top']}, {rect['x1']}, {rect['bottom']})") print(f" Fill: {rect['non_stroking_color']}, Stroke: {rect['stroking_color']}") # Access curve objects for curve in page.curves[:2]: print(f"Curve with {len(curve['pts'])} points") print(f" Fill: {curve['fill']}, Dash: {curve['dash']}") # Access image objects for img in page.images: print(f"Image at ({img['x0']}, {img['top']}), size: {img['srcsize']}") print(f" Colorspace: {img['colorspace']}, Bits: {img['bits']}") # Access annotations and hyperlinks for annot in page.annots: print(f"Annotation: {annot['contents']}") for link in page.hyperlinks: print(f"Link: {link['uri']} at ({link['x0']}, {link['top']})") # Access edge objects (derived from lines, rects, curves) print(f"Total edges: {len(page.edges)}") print(f"Horizontal edges: {len(page.horizontal_edges)}") print(f"Vertical edges: {len(page.vertical_edges)}") ``` -------------------------------- ### Visual Debugging API Source: https://context7.com/jsvine/pdfplumber/llms.txt Methods for rendering PDF pages as images and drawing annotations to visualize extracted data and structure. ```APIDOC ## POST /visual-debug ### Description Converts a PDF page into an image object to perform visual debugging, such as drawing rectangles around characters, words, or lines, and visualizing table detection. ### Method POST ### Parameters #### Request Body - **resolution** (int) - Optional - DPI for the rendered image. - **width** (int) - Optional - Width in pixels. - **height** (int) - Optional - Height in pixels. - **antialias** (bool) - Optional - Enable smoother rendering. ### Response #### Success Response (200) - **image_object** (object) - An image object with methods like draw_rects, draw_lines, and save. ``` -------------------------------- ### Create a PageImage object Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md Converts a PDF page into a PageImage object. Supports optional parameters like resolution and antialiasing to control output quality. ```python im = my_pdf.pages[0].to_image(resolution=150) ``` -------------------------------- ### Repair PDF on the fly with pdfplumber Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/repairing.md This method repairs a malformed PDF file when opening it with pdfplumber. The repair is done in memory and the repaired version is not saved to disk. It takes the file path as input and accepts a boolean `repair` argument. ```python import pdfplumber with pdfplumber.open("malformed.pdf", repair=True) as pdf: # Process the PDF print(pdf.pages[0].extract_text()) ``` -------------------------------- ### Table Extraction Strategies Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md Explanation of the different strategies available for `vertical_strategy` and `horizontal_strategy` in pdfplumber's table extraction. ```APIDOC ## Table-extraction strategies Both `vertical_strategy` and `horizontal_strategy` accept the following options: | Strategy | | Description | |----------|-------------| | `"lines"` | Use the page's graphical lines — including the sides of rectangle objects — as the borders of potential table-cells. | | `"lines_strict"` | Use the page's graphical lines — but *not* the sides of rectangle objects — as the borders of potential table-cells. | | `"text"` | For `vertical_strategy`: Deduce the (imaginary) lines that connect the left, right, or center of words on the page, and use those lines as the borders of potential table-cells. For `horizontal_strategy`, the same but using the tops of words. | | `"explicit"` | Only use the lines explicitly defined in `explicit_vertical_lines` / `explicit_horizontal_lines`. | ``` -------------------------------- ### Use Strict Line-Based Table Extraction Strategy Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md Explains the 'lines_strict' strategy for pdfplumber table extraction, which uses only graphical lines as cell borders, excluding the edges of rectangle objects. This provides a more constrained approach to identifying table structures. ```python "vertical_strategy": "lines_strict", "horizontal_strategy": "lines_strict" ``` -------------------------------- ### Extract Table from PDF Page Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md Demonstrates how to open a PDF file and extract the largest table from the first page using pdfplumber's extract_table method. ```python import pdfplumber with pdfplumber.open("path/to/my.pdf") as pdf: page = pdf.pages[0] table_data = page.extract_table() ```