### Initialize pdfplumber and Check Version

Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/ag-energy-roundup-curves.ipynb

This snippet shows how to import the pdfplumber library and print its version. It's a basic setup step for using the library.

```python
import pdfplumber
print(pdfplumber.__version__)
```

--------------------------------

### Install pdfplumber using pip

Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md

Installs the pdfplumber library using pip, the Python package installer. This is the standard method for adding the library to your Python environment.

```shell
pip install pdfplumber
```

--------------------------------

### Command Line Interface Extraction

Source: https://context7.com/jsvine/pdfplumber/llms.txt

Provides examples of using the pdfplumber CLI to extract PDF content into various formats like CSV and JSON, and to inspect the document structure tree.

```bash
pdfplumber document.pdf > output.csv
pdfplumber document.pdf --format json > output.json
pdfplumber document.pdf --pages 1 3-5 10
pdfplumber document.pdf --structure --indent 2
```

--------------------------------

### GET /structure

Source: https://context7.com/jsvine/pdfplumber/llms.txt

Retrieves the logical structure tree of the PDF, including headings, paragraphs, and lists.

```APIDOC
## GET /structure

### Description
Retrieves the semantic structure tree of the PDF document or a specific page if available.

### Method
GET

### Endpoint
/structure

### Parameters
#### Query Parameters
- **page_number** (int) - Optional - The specific page to retrieve the structure for.

### Response
#### Success Response (200)
- **structure** (object) - The hierarchical structure tree of the document.

#### Response Example
{
  "type": "Document",
  "children": [
    { "type": "H1", "text": "Title" },
    { "type": "P", "text": "Paragraph content" }
  ]
}
```

--------------------------------

### Search Text in PDF with pdfplumber

Source: https://context7.com/jsvine/pdfplumber/llms.txt

Demonstrates how to search for text in a PDF page using literal strings or regular expressions. It includes examples of case-insensitive searches, regex group extraction, and accessing character-level metadata.

```python
import pdfplumber
import re

with pdfplumber.open("document.pdf") as pdf:
    page = pdf.pages[0]

    # Search for literal string
    results = page.search("invoice", regex=False, case=False)
    for match in results:
        print(f"Found '{match['text']}' at ({match['x0']}, {match['top']})")

    # Search with regex pattern
    results = page.search(r"\$[\d,]+\.\d{2}")  # Match currency amounts
    for match in results:
        print(f"Amount found: {match['text']}")
        print(f"Bounding box: ({match['x0']}, {match['top']}, {match['x1']}, {match['bottom']})")

    # Search with compiled regex and get regex groups
    pattern = re.compile(r"(\w+)@(\w+\.\w+)")  # Email pattern
    results = page.search(
        pattern,
        return_groups=True,
        return_chars=True,
        main_group=0
    )

    for match in results:
        print(f"Email: {match['text']}")
        print(f"Groups: {match['groups']}")
        print(f"Character objects: {len(match['chars'])} chars")

    # Case-insensitive search
    results = page.search("TOTAL", case=False)
```

--------------------------------

### GET .extract_words

Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md

Returns a list of all word-looking objects along with their bounding boxes and optional attributes.

```APIDOC
## GET .extract_words

### Description
Identifies sequences of characters as words based on x/y tolerance and returns their bounding boxes. Supports advanced features like ligature expansion and character attribute grouping.

### Method
GET

### Endpoint
.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, ...)

### Parameters
#### Query Parameters
- **x_tolerance** (int) - Optional - Horizontal threshold for word grouping.
- **y_tolerance** (int) - Optional - Vertical threshold for word grouping.
- **extra_attrs** (list) - Optional - List of character attributes to group by.

### Response
#### Success Response (200)
- **words** (list) - A list of dictionaries containing word text, bounding boxes, and optional attributes.
```

--------------------------------

### Specify Ghostscript path for PDF repair in pdfplumber

Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/repairing.md

When repairing PDFs, you can explicitly provide the path to the Ghostscript executable using the `gs_path` argument. This is helpful if pdfplumber cannot automatically locate your Ghostscript installation. This parameter can be used with any of the repair methods.

```python
import pdfplumber

# Example using repair and saving to file with custom gs_path
pdfplumber.repair("malformed.pdf", outfile="repaired.pdf", gs_path="/usr/local/bin/gs")

# Example using open with repair and custom gs_path
# with pdfplumber.open("malformed.pdf", repair=True, gs_path="/usr/local/bin/gs") as pdf:
#     print(pdf.pages[0].extract_text())
```

--------------------------------

### Repair PDF and get bytes with pdfplumber

Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/repairing.md

This function repairs a PDF file and returns its content as a BytesIO object. This is useful when you need to process the repaired PDF content in memory without saving it to a new file. It takes the path to the malformed PDF as input.

```python
import pdfplumber
from io import BytesIO

repaired_pdf_bytes: BytesIO = pdfplumber.repair("malformed.pdf")
# You can now use repaired_pdf_bytes, for example, to open it again with pdfplumber
# with pdfplumber.open(repaired_pdf_bytes) as pdf:
#     print(pdf.pages[0].extract_text())
```

--------------------------------

### Initialize pdfplumber and Load PDF

Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/extract-table-nics.ipynb

Initializes the pdfplumber library and opens a target PDF file for processing.

```python
import pdfplumber
pdf = pdfplumber.open("../pdfs/background-checks.pdf")
```

--------------------------------

### Working with Transformations and Coordinates

Source: https://context7.com/jsvine/pdfplumber/llms.txt

Explains how to access character transformation matrices (CTM) and page-level coordinate properties such as MediaBox, CropBox, and doctop values.

```python
import pdfplumber
from pdfplumber.ctm import CTM

with pdfplumber.open("document.pdf") as pdf:
    page = pdf.pages[0]

    # Access character transformation matrix
    for char in page.chars[:3]:
        if "matrix" in char:
            ctm = CTM(*char["matrix"])
            print(f"Character: '{char['text']}'")
            print(f"  Position: ({char['x0']}, {char['top']})")
            print(f"  Rotation/skew: {ctm.skew_x}")

    # Page coordinate properties
    print(f"MediaBox: {page.mediabox}")
    print(f"CropBox: {page.cropbox}")
    print(f"BBox: {page.bbox}")
    print(f"Rotation: {page.rotation}")

    # Document-level coordinates (doctop spans all pages)
    for char in page.chars[:5]:
        print(f"'{char['text']}': top={char['top']}, doctop={char['doctop']}")
```

--------------------------------

### GET .extract_text

Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md

Collates all of the page's character objects into a single string, with options for layout preservation.

```APIDOC
## GET .extract_text

### Description
Collates all of the page's character objects into a single string. When layout=False, it uses tolerance parameters to insert spaces and newlines. When layout=True, it attempts to mimic the visual structure of the page.

### Method
GET

### Endpoint
.extract_text(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, layout=False, x_density=7.25, y_density=13, line_dir_render=None, char_dir_render=None, **kwargs)

### Parameters
#### Query Parameters
- **x_tolerance** (int) - Optional - Horizontal distance threshold for spacing.
- **y_tolerance** (int) - Optional - Vertical distance threshold for newlines.
- **layout** (boolean) - Optional - Whether to attempt to preserve visual layout.

### Response
#### Success Response (200)
- **text** (string) - The extracted text content from the page.
```

--------------------------------

### Initialize pdfplumber and Open PDF

Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/extract-table-ca-warn-report.ipynb

Imports the pdfplumber library and opens a target PDF file for processing. This is the foundational step for any extraction task.

```python
import pdfplumber
print(pdfplumber.__version__)
pdf = pdfplumber.open("../pdfs/ca-warn-report.pdf")
```

--------------------------------

### Open and Load PDFs with pdfplumber

Source: https://context7.com/jsvine/pdfplumber/llms.txt

Demonstrates various ways to open PDF files using pdfplumber.open(), including basic usage, password-protected files, custom layout analysis parameters, Unicode normalization, opening specific pages, and loading from a BytesIO stream. It shows how to access page count, metadata, and page dimensions.

```python
import pdfplumber

# Basic usage - open a PDF file
with pdfplumber.open("document.pdf") as pdf:
    print(f"Number of pages: {len(pdf.pages)}")
    print(f"Metadata: {pdf.metadata}")

    # Access first page
    first_page = pdf.pages[0]
    print(f"Page dimensions: {first_page.width} x {first_page.height}")

# Open password-protected PDF
with pdfplumber.open("protected.pdf", password="secret123") as pdf:
    text = pdf.pages[0].extract_text()
    print(text)

# Open with layout analysis parameters for higher-level text objects
with pdfplumber.open("document.pdf", laparams={"line_overlap": 0.7}) as pdf:
    page = pdf.pages[0]
    # Access textboxhorizontal objects when laparams is set
    textboxes = page.textboxhorizontals
    for box in textboxes:
        print(box["text"])

# Open with Unicode normalization
with pdfplumber.open("document.pdf", unicode_norm="NFC") as pdf:
    text = pdf.pages[0].extract_text()

# Open specific pages only (1-indexed)
with pdfplumber.open("large_document.pdf", pages=[1, 5, 10]) as pdf:
    for page in pdf.pages:
        print(f"Page {page.page_number}: {page.extract_text()[:100]}")

# Open from BytesIO stream
from io import BytesIO
with open("document.pdf", "rb") as f:
    stream = BytesIO(f.read())
with pdfplumber.open(stream) as pdf:
    print(pdf.pages[0].extract_text())
```

--------------------------------

### GET /page/extract_text

Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/san-jose-pd-firearm-report.ipynb

Extracts all text content from a specific PDF page object, allowing for configuration of whitespace handling.

```APIDOC
## GET /page/extract_text

### Description
Extracts the text from a PDF page object. This method processes the page content line by line and can be configured to preserve or strip whitespace characters.

### Method
GET

### Endpoint
Page.extract_text()

### Parameters
#### Query Parameters
- **keep_blank_chars** (boolean) - Optional - If set to True, retains all whitespace characters as literal characters in the output.

### Request Example
```python
text = page.extract_text(keep_blank_chars=True)
```

### Response
#### Success Response (200)
- **text** (string) - The full text content extracted from the PDF page.

#### Response Example
```text
For:1094N
Page 1 SAN JOSE POLICE DEPT Date Report Run : Tue, May-24-16
FIREARM SEARCH
... (rest of the extracted text)
```
```

--------------------------------

### Using PDFStructTree for Visual Debugging

Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/structure.md

Demonstrates how to use the `PDFStructTree` class for advanced analysis, including plotting bounding boxes of specific element types.

```APIDOC
## Using PDFStructTree for Visual Debugging

### Description
This example utilizes the `PDFStructTree` class to perform more advanced operations, such as finding all elements of a specific type (e.g., 'TD') and drawing their bounding boxes on a page image.

### Method
```python
import pdfplumber
from pdfplumber.structuretree import PDFStructTree

# Assuming 'pdffile' is the path to your PDF file
with pdfplumber.open(pdffile) as pdf:
    page = pdf.pages[0] # Get the first page
    stree = PDFStructTree(pdf, page) # Initialize PDFStructTree for the page
    img = page.to_image() # Convert page to an image object

    # Find all 'TD' elements and draw their bounding boxes
    td_elements = [td for td in stree.find_all("TD")]
    img.draw_rects(stree.element_bbox(td) for td in td_elements)

    # To save or display the image:
    # img.save("page_with_td_bboxes.png")
    # img.show()
```

### Endpoint
N/A (This is a library usage example)

### Parameters
N/A

### Request Body
N/A

### Response
N/A (Generates an image with bounding boxes drawn)

### Notes
- `PDFStructTree(pdf, page)` initializes the structure tree analysis for a given page.
- `stree.find_all(element_name)` searches for elements by name, regex, or function, similar to BeautifulSoup.
- `stree.element_bbox(element)` returns the bounding box of a given structure element.
```

--------------------------------

### pdfplumber Python: Loading a PDF with options

Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md

Illustrates various ways to load a PDF file using pdfplumber, including specifying file paths, byte streams, and file-like objects. It also shows how to handle password-protected PDFs and configure layout analysis parameters.

```python
import pdfplumber

# Load from file path
with pdfplumber.open("path/to/file.pdf") as pdf:
    pass

# Load from file object (bytes)
with open("path/to/file.pdf", "rb") as f:
    with pdfplumber.open(f) as pdf:
        pass

# Load password-protected PDF
with pdfplumber.open("file.pdf", password = "test") as pdf:
    pass

# Set layout analysis parameters
with pdfplumber.open("file.pdf", laparams = { "line_overlap": 0.7 }) as pdf:
    pass

# Pre-normalize Unicode text
with pdfplumber.open("file.pdf", unicode_norm="NFC") as pdf:
    pass

# Strict metadata parsing
with pdfplumber.open("file.pdf", strict_metadata=True) as pdf:
    pass
```

--------------------------------

### Visual Debugging with pdfplumber

Source: https://context7.com/jsvine/pdfplumber/llms.txt

Demonstrates how to render PDF pages as images and overlay annotations such as rectangles, lines, and table detection results. This is essential for troubleshooting extraction logic and verifying object coordinates.

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    page = pdf.pages[0]
    im = page.to_image(resolution=150)
    im.draw_rects(page.chars, stroke=(255, 0, 0), fill=(255, 0, 0, 50))
    im.draw_lines(page.lines, stroke=(0, 0, 255), stroke_width=2)
    im.debug_tablefinder({"vertical_strategy": "lines", "horizontal_strategy": "lines"})
    im.save("debug_output.png", format="PNG")
    im.show()
```

--------------------------------

### Access and Visualize PDF Pages

Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/extract-table-ca-warn-report.ipynb

Demonstrates how to retrieve a specific page object from the PDF and convert it into an image for visual inspection.

```python
p0 = pdf.pages[0]
im = p0.to_image()
im
```

--------------------------------

### Load PDF Page with pdfplumber

Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/ag-energy-roundup-curves.ipynb

Demonstrates loading a specific page from a PDF file using pdfplumber. It opens a PDF and selects the first page for further processing.

```python
report = pdfplumber.open("../pdfs/ag-energy-round-up-2017-02-24.pdf").pages[0]
```

--------------------------------

### Python: Open and Read PDF

Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md

The Python library provides a context manager to open and interact with PDF files programmatically.

```APIDOC
## POST /pdfplumber/open

### Description
Opens a PDF file for reading and returns a PDF object instance.

### Method
Python Method: pdfplumber.open(path, **kwargs)

### Parameters
#### Request Body
- **path** (string) - Required - Path to the PDF file or file-like object.
- **password** (string) - Optional - Password for protected PDFs.
- **laparams** (dict) - Optional - Layout analysis parameters for pdfminer.six.
- **unicode_norm** (string) - Optional - Unicode normalization form (NFC, NFD, NFKC, NFKD).

### Request Example
import pdfplumber
with pdfplumber.open("file.pdf", password="secret") as pdf:
    page = pdf.pages[0]

### Response
#### Success Response (200)
- **pdf** (object) - An instance of the pdfplumber.PDF class.
```

--------------------------------

### Accessing PDF Logical Structure Tree

Source: https://context7.com/jsvine/pdfplumber/llms.txt

Shows how to retrieve and traverse the logical structure tree of a PDF, enabling the identification of semantic elements like headings and paragraphs.

```python
import pdfplumber
import json

with pdfplumber.open("structured_document.pdf") as pdf:
    # Get structure tree for entire document
    structure = pdf.structure_tree
    if structure:
        print(json.dumps(structure, indent=2))
    else:
        print("No structure tree found")

    # Get structure tree for specific page
    page = pdf.pages[0]
    page_structure = page.structure_tree

    # Navigate structure tree
    def print_structure(elements, indent=0):
        for elem in elements:
            print("  " * indent + f"{elem.get('type', 'unknown')}")
            if 'children' in elem:
                print_structure(elem['children'], indent + 1)

    print_structure(structure)
```

--------------------------------

### Working with Element Attributes (BBox)

Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/structure.md

Illustrates how to access and convert bounding box (BBox) attributes from PDF coordinate space to pdfplumber's coordinate space.

```APIDOC
## Working with Element Attributes (BBox)

### Description
This example shows how to extract and convert the `BBox` attribute from a structure element, such as a `Table`, `Figure`, or `Image`, into pdfplumber's coordinate system.

### Method
```python
# Assuming 'element' is a structure tree element with a 'BBox' attribute
# Assuming 'page' is a pdfplumber Page object

x0, y0, x1, y1 = element['attributes']['BBox']
top = page.height - y1
bottom = page.height - y0
doctop = page.initial_doctop + top
bbox = (x0, top, x1, bottom)

print(f"Original BBox: ({x0}, {y0}, {x1}, {y1})")
print(f"Converted BBox in pdfplumber space: {bbox}")
```

### Endpoint
N/A (This is a library usage example)

### Parameters
N/A

### Request Body
N/A

### Response
N/A (Prints to console)

### Notes
- PDF coordinate space has the origin at the bottom-left.
- `page.height` and `page.initial_doctop` are used for conversion.
- The `BBox` attribute provides the bounding box coordinates.
```

--------------------------------

### Parse and Display Raw Data (Python)

Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/san-jose-pd-firearm-report.ipynb

This snippet demonstrates how to parse data from a PDF report and display the first two entries of the parsed data. It assumes the data has already been parsed into a list of dictionaries.

```python
parsed[:2]
```

--------------------------------

### Calculate Character Rotation using CTM

Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md

Demonstrates how to instantiate a CTM object from a character's matrix property to calculate rotation or skew. This is useful for determining the orientation of text within a PDF.

```python
from pdfplumber.ctm import CTM
my_char = pdf.pages[0].chars[3]
my_char_ctm = CTM(*my_char["matrix"])
my_char_rotation = my_char_ctm.skew_x
```

--------------------------------

### Import Pandas and Create DataFrame

Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/extract-table-ca-warn-report.ipynb

Imports the pandas library and creates a DataFrame from extracted table data. Assumes 'table' is a pre-existing list of lists.

```python
import pandas as pd
df = pd.DataFrame(table[1:], columns=table[0])
```

--------------------------------

### Extract Text from PDF Pages with pdfplumber

Source: https://context7.com/jsvine/pdfplumber/llms.txt

Illustrates different methods for extracting text from PDF pages using pdfplumber. Includes basic extraction, layout-preserving extraction, customized extraction with tolerances and density parameters, simple/fast extraction, and extracting individual words with bounding boxes and character details. Also shows how to extract text lines.

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    page = pdf.pages[0]

    # Basic text extraction
    text = page.extract_text()
    print(text)

    # Extract text with layout preservation (mimics visual structure)
    text_with_layout = page.extract_text(layout=True)
    print(text_with_layout)

    # Customized text extraction with tolerances
    text = page.extract_text(
        x_tolerance=3,      # Horizontal spacing tolerance
        y_tolerance=3,      # Vertical spacing tolerance
        layout=True,
        x_density=7.25,     # Characters per point (horizontal)
        y_density=13        # Lines per point (vertical)
    )

    # Simple/fast text extraction
    simple_text = page.extract_text_simple(x_tolerance=3, y_tolerance=3)

    # Extract words with bounding boxes
    words = page.extract_words(
        x_tolerance=3,
        y_tolerance=3,
        keep_blank_chars=False,
        use_text_flow=False,
        extra_attrs=["fontname", "size"],  # Include font info per word
        split_at_punctuation=True,
        expand_ligatures=True,
        return_chars=True  # Include individual char objects
    )

    for word in words[:5]:
        print(f"Word: '{word['text']}' at ({word['x0']}, {word['top']}) - ({word['x1']}, {word['bottom']})")
        if 'fontname' in word:
            print(f"  Font: {word['fontname']}, Size: {word['size']}")

    # Extract text lines with character details
    text_lines = page.extract_text_lines(
        layout=True,
        strip=True,
        return_chars=True
    )

    for line in text_lines[:3]:
        print(f"Line: '{line['text']}' at top={line['top']}")
```

--------------------------------

### Extract Tables from PDF using pdfplumber

Source: https://context7.com/jsvine/pdfplumber/llms.txt

Demonstrates various methods for extracting tables from PDF pages using pdfplumber. It covers default extraction, extracting the largest table, finding table objects with metadata, and using custom settings for tables without visible lines, explicit lines, or strict line detection. It also shows how to debug the table finder.

```python
import pdfplumber

with pdfplumber.open("document_with_tables.pdf") as pdf:
    page = pdf.pages[0]

    # Extract all tables from page (returns list of 2D arrays)
    tables = page.extract_tables()
    for i, table in enumerate(tables):
        print(f"Table {i+1}:")
        for row in table:
            print(row)

    # Extract the largest table
    table = page.extract_table()
    if table:
        headers = table[0]
        for row in table[1:]:
            print(dict(zip(headers, row)))

    # Find table objects (with metadata like cells, bbox)
    table_objects = page.find_tables()
    for tbl in table_objects:
        print(f"Table bbox: {tbl.bbox}")
        print(f"Cells: {len(tbl.cells)}")
        print(f"Rows: {len(tbl.rows)}")

        # Extract table data
        data = tbl.extract()
        print(data)

    # Custom table settings for tables without visible lines
    table_settings = {
        "vertical_strategy": "text",     # Use text alignment
        "horizontal_strategy": "text",
        "min_words_vertical": 3,
        "min_words_horizontal": 1,
        "snap_tolerance": 3,
        "join_tolerance": 3,
        "edge_min_length": 3,
        "intersection_tolerance": 3,
        "text_x_tolerance": 3,
        "text_y_tolerance": 3,
    }
    tables = page.extract_tables(table_settings)

    # Use explicit lines strategy
    table_settings = {
        "vertical_strategy": "explicit",
        "horizontal_strategy": "explicit",
        "explicit_vertical_lines": [50, 150, 300, 450],  # x-coordinates
        "explicit_horizontal_lines": [100, 200, 300, 400],  # y-coordinates
    }
    tables = page.extract_tables(table_settings)

    # Use lines_strict (only actual lines, not rectangle edges)
    table_settings = {
        "vertical_strategy": "lines_strict",
        "horizontal_strategy": "lines_strict",
    }
    tables = page.extract_tables(table_settings)

    # Debug table finder to understand detection
    finder = page.debug_tablefinder(table_settings)
    print(f"Edges found: {len(finder.edges)}")
    print(f"Intersections: {len(finder.intersections)}")
    print(f"Cells: {len(finder.cells)}")
    print(f"Tables: {len(finder.tables)}")
```

--------------------------------

### Sort Data by Handgun Checks in Python

Source: https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/extract-table-nics.ipynb

Demonstrates sorting the processed data (`data`) in descending order based on the 'handgun' count and prints the top 6 entries, formatted for readability. This helps identify states with the highest handgun-only background checks.

```python
for row in list(reversed(sorted(data, key=lambda x: x["handgun"])))[:6]:
    print("{state}: {handgun:,d} handgun-only checks".format(**row))
```

--------------------------------

### Crop and Filter PDF Pages with pdfplumber

Source: https://context7.com/jsvine/pdfplumber/llms.txt

Illustrates how to manipulate PDF pages using pdfplumber. This includes cropping pages to specific regions using bounding boxes, filtering objects based on custom criteria (like size or color), removing duplicate characters, and chaining these operations for complex page processing.

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    page = pdf.pages[0]

    # Crop to specific bounding box (x0, top, x1, bottom)
    cropped = page.crop((0, 0, page.width / 2, page.height / 2))
    text = cropped.extract_text()
    print(f"Text from top-left quarter: {text[:100]}")

    # Crop with relative coordinates (offset from page origin)
    cropped = page.crop((50, 100, 300, 400), relative=True)

    # Get only objects fully within a bounding box
    within = page.within_bbox((100, 100, 400, 400))
    print(f"Objects fully within box: {len(within.chars)} chars")

    # Get objects outside a bounding box
    outside = page.outside_bbox((100, 100, 400, 400))
    print(f"Objects outside box: {len(outside.chars)} chars")

    # Filter objects with custom function
    def is_large_text(obj):
        return obj.get("size", 0) > 12

    filtered = page.filter(is_large_text)
    large_text = filtered.extract_text()
    print(f"Large text only: {large_text}")

    # Filter by color
    def is_red_text(obj):
        color = obj.get("non_stroking_color", (0, 0, 0))
        if isinstance(color, tuple) and len(color) >= 3:
            return color[0] > 0.5 and color[1] < 0.3 and color[2] < 0.3
        return False

    red_text_page = page.filter(is_red_text)
    print(f"Red text: {red_text_page.extract_text()}")

    # Remove duplicate characters
    deduped = page.dedupe_chars(tolerance=1, extra_attrs=("fontname", "size"))
    print(f"Original chars: {len(page.chars)}, Deduped: {len(deduped.chars)}")

    # Chain operations
    result = (
        page
        .crop((50, 50, 500, 700))
        .filter(lambda obj: obj.get("size", 0) > 10)
        .dedupe_chars()
    )
    text = result.extract_text()
```

--------------------------------

### pdfplumber.open()

Source: https://context7.com/jsvine/pdfplumber/llms.txt

Opens a PDF file or stream and returns a PDF object for accessing pages and metadata.

```APIDOC
## GET /pdfplumber/open

### Description
Opens a PDF document from a file path or stream. Supports password protection, layout analysis parameters, and Unicode normalization.

### Method
GET (Library Function)

### Parameters
#### Path Parameters
- **path** (string/file-like) - Required - The file path, file object, or BytesIO stream of the PDF.

#### Query Parameters
- **password** (string) - Optional - Password for protected PDFs.
- **laparams** (dict) - Optional - Layout analysis parameters for text object detection.
- **unicode_norm** (string) - Optional - Unicode normalization form (e.g., 'NFC').
- **pages** (list) - Optional - List of specific page numbers to load.

### Request Example
pdfplumber.open("document.pdf", password="secret123", laparams={"line_overlap": 0.7})

### Response
#### Success Response (200)
- **pdf** (object) - A PDF object containing metadata and a list of page objects.
```

--------------------------------

### Accessing Page Structure Tree Elements

Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/structure.md

Demonstrates how to iterate through the structure tree of a specific PDF page to retrieve element types and their associated marked content IDs.

```python
with pdfplumber.open(pdffile) as pdf:
    for element in pdf.pages[0].structure_tree:
         print(element["type"], element["mcids"])
         for child in element.children:
             print(child["type"], child["mcids"])
```

--------------------------------

### Repairing Malformed PDFs

Source: https://context7.com/jsvine/pdfplumber/llms.txt

Shows how to use the repair utility to fix corrupted PDF files using Ghostscript. It supports direct file output, stream processing, and integration with pdfplumber's open method.

```python
from pdfplumber import repair
import pdfplumber

# Repair and save to file
repair("malformed.pdf", outfile="repaired.pdf")

# Open with automatic repair
with pdfplumber.open("malformed.pdf", repair=True) as pdf:
    text = pdf.pages[0].extract_text()
```

--------------------------------

### Apply Explicit Table Extraction Strategy

Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md

Demonstrates the 'explicit' strategy for pdfplumber table extraction, which relies solely on lines defined in 'explicit_vertical_lines' and 'explicit_horizontal_lines'. This offers the highest level of control over table structure definition.

```python
"vertical_strategy": "explicit",
"horizontal_strategy": "explicit"
```

--------------------------------

### Accessing Structure Tree for a Page

Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/structure.md

Demonstrates how to iterate through the structure tree of a specific page and access element types and marked content IDs (MCIDs).

```APIDOC
## Accessing Structure Tree for a Page

### Description
This code snippet shows how to access the structure tree for a specific page in a PDF document and iterate through its elements, printing their types and associated MCIDs.

### Method
```python
with pdfplumber.open(pdffile) as pdf:
    for element in pdf.pages[0].structure_tree:
         print(element["type"], element["mcids"])
         for child in element.children:
             print(child["type"], child["mcids"])
```

### Endpoint
N/A (This is a library usage example)

### Parameters
N/A

### Request Body
N/A

### Response
N/A (Prints to console)

### Notes
- The `type` field indicates the structure element type (e.g., 'P', 'H1', 'Table').
- The `mcids` field is a list of marked content section IDs related to the element.
- Additional fields like `lang`, `alt_text`, and `attributes` may be present.
```

--------------------------------

### pdfplumber Python: Basic PDF page access

Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md

Demonstrates the basic usage of the pdfplumber Python library to open a PDF file and access the first page's characters. It shows how to import the library and iterate through page elements.

```python
import pdfplumber

with pdfplumber.open("path/to/file.pdf") as pdf:
    first_page = pdf.pages[0]
    print(first_page.chars[0])
```

--------------------------------

### Visualizing Element Bounding Boxes with PDFStructTree

Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/structure.md

Uses the PDFStructTree object to find specific elements like table cells (TD) and draw their bounding boxes on a page image.

```python
page = pdf.pages[0]
stree = PDFStructTree(pdf, page)
img = page.to_image()
img.draw_rects(stree.element_bbox(td) for td in table.find_all("TD"))
```

--------------------------------

### Configure pdfplumber Table Extraction Settings

Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md

This snippet shows the default settings dictionary for pdfplumber's extract_tables method. These settings control how tables are identified and extracted, including strategies for vertical and horizontal separation, tolerance levels for snapping and joining lines, and text extraction parameters.

```python
{
    "vertical_strategy": "lines", 
    "horizontal_strategy": "lines",
    "explicit_vertical_lines": [],
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "snap_x_tolerance": 3,
    "snap_y_tolerance": 3,
    "join_tolerance": 3,
    "join_x_tolerance": 3,
    "join_y_tolerance": 3,
    "edge_min_length": 3,
    "edge_min_length_prefilter": 1,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
    "intersection_tolerance": 3,
    "intersection_x_tolerance": 3,
    "intersection_y_tolerance": 3,
    "text_tolerance": 3,
    "text_x_tolerance": 3,
    "text_y_tolerance": 3,
    "text_*": "…"
}
```

--------------------------------

### Table Extraction Settings API

Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md

This section details the `table_settings` argument for the `extract_tables` method in pdfplumber, outlining all available configuration options and their default values.

```APIDOC
## Table-extraction settings

By default, `extract_tables` uses the page's vertical and horizontal lines (or rectangle edges) as cell-separators. But the method is highly customizable via the `table_settings` argument. The possible settings, and their defaults:

```python
{
    "vertical_strategy": "lines", 
    "horizontal_strategy": "lines",
    "explicit_vertical_lines": [],
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "snap_x_tolerance": 3,
    "snap_y_tolerance": 3,
    "join_tolerance": 3,
    "join_x_tolerance": 3,
    "join_y_tolerance": 3,
    "edge_min_length": 3,
    "edge_min_length_prefilter": 1,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
    "intersection_tolerance": 3,
    "intersection_x_tolerance": 3,
    "intersection_y_tolerance": 3,
    "text_tolerance": 3,
    "text_x_tolerance": 3,
    "text_y_tolerance": 3,
    "text_*": …,
}
```

| Setting |
| Description |
|---------|-------------|
|`"vertical_strategy"`| Either `"lines"`, `"lines_strict"`, `"text"`, or `"explicit"`. See explanation below.|
|`"horizontal_strategy"`| Either `"lines"`, `"lines_strict"`, `"text"`, or `"explicit"`. See explanation below.|
|`"explicit_vertical_lines"`| A list of vertical lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers — indicating the `x` coordinate of a line the full height of the page — or `line`/`rect`/`curve` objects.|
|`"explicit_horizontal_lines"`| A list of horizontal lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers — indicating the `y` coordinate of a line the full height of the page — or `line`/`rect`/`curve` objects.|
|`"snap_tolerance"`, `"snap_x_tolerance"`, `"snap_y_tolerance"`| Parallel lines within `snap_tolerance` points will be "snapped" to the same horizontal or vertical position.|
|`"join_tolerance"`, `"join_x_tolerance"`, `"join_y_tolerance"`| Line segments on the same infinite line, and whose ends are within `join_tolerance` of one another, will be "joined" into a single line segment.|
|`"edge_min_length"`| Edges shorter than `edge_min_length` will be discarded before attempting to reconstruct the table.|
|`"edge_min_length_prefilter"`| Edges shorter than `edge_min_length_prefilter` will be discarded during initial edge filtering from the page. Lowering this value (e.g., to `0.5`) can help capture small dashed lines that might otherwise be filtered out.|
|`"min_words_vertical"`| When using `"vertical_strategy": "text"`, at least `min_words_vertical` words must share the same alignment.|
|`"min_words_horizontal"`| When using `"horizontal_strategy": "text"`, at least `min_words_horizontal` words must share the same alignment.|
|`"intersection_tolerance"`, `"intersection_x_tolerance"`, `"intersection_y_tolerance"`| When combining edges into cells, orthogonal edges must be within `intersection_tolerance` points to be considered intersecting.|
|`"text_*"`| All settings prefixed with `text_` are then used when extracting text from each discovered table. All possible arguments to `Page.extract_text(...)` are also valid here.|
|`"text_x_tolerance"`, `"text_y_tolerance"`| These `text_`-prefixed settings *also* apply to the table-identification algorithm when the `text` strategy is used. I.e., when that algorithm searches for words, it will expect the individual letters in each word to be no more than `text_x_tolerance`/`text_y_tolerance` points apart.|


```

--------------------------------

### Configure Text-Based Table Extraction Strategy

Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md

Shows how to configure the 'text' strategy for vertical and horizontal table extraction in pdfplumber. This strategy deduces cell boundaries based on word alignment, requiring a minimum number of words ('min_words_vertical', 'min_words_horizontal') to share an alignment.

```python
"vertical_strategy": "text",
"horizontal_strategy": "text",
"min_words_vertical": 5,
"min_words_horizontal": 2
```

--------------------------------

### Converting PDF BBox Coordinates to pdfplumber Space

Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/structure.md

Shows how to transform a BBox attribute from the PDF coordinate system (origin at bottom-left) to the pdfplumber coordinate system.

```python
x0, y0, x1, y1 = element['attributes']['BBox']
top = page.height - y1
bottom = page.height - y0
doctop = page.initial_doctop + top
bbox = (x0, top, x1, bottom)
```

--------------------------------

### Notes on Table Extraction

Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md

Additional notes and considerations for using pdfplumber's table extraction feature, including version changes.

```APIDOC
## Notes

- Often it's helpful to crop a page — `Page.crop(bounding_box)` — before trying to extract the table. 

- Table extraction for `pdfplumber` was radically redesigned for `v0.5.0`, and introduced breaking changes.

```

--------------------------------

### Crop Page Before Table Extraction

Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md

Provides a note on a common practice in pdfplumber: cropping a page using Page.crop(bounding_box) before attempting table extraction. This can help isolate specific table areas and improve extraction accuracy.

```python
# Example usage (assuming 'page' is a pdfplumber Page object)
# bounding_box = (x0, y0, x1, y1)
# cropped_page = page.crop(bounding_box)
# cropped_page.extract_tables()
```

--------------------------------

### Access PDF Objects and Attributes with pdfplumber

Source: https://context7.com/jsvine/pdfplumber/llms.txt

Shows how to retrieve and inspect various PDF objects such as characters, lines, rectangles, curves, images, and annotations. Each object provides detailed attributes like coordinates, styling, and color information.

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    page = pdf.pages[0]

    # Access all objects dictionary
    print(f"Object types found: {page.objects.keys()}")

    # Access character objects
    for char in page.chars[:5]:
        print(f"Char: '{char['text']}' at ({char['x0']}, {char['top']})")
        print(f"  Font: {char['fontname']}, Size: {char['size']}")
        print(f"  Color: {char['non_stroking_color']}")

    # Access line objects
    for line in page.lines[:3]:
        print(f"Line from ({line['x0']}, {line['top']}) to ({line['x1']}, {line['bottom']})")
        print(f"  Width: {line['linewidth']}, Color: {line['stroking_color']}")

    # Access rectangle objects
    for rect in page.rects[:3]:
        print(f"Rect: ({rect['x0']}, {rect['top']}, {rect['x1']}, {rect['bottom']})")
        print(f"  Fill: {rect['non_stroking_color']}, Stroke: {rect['stroking_color']}")

    # Access curve objects
    for curve in page.curves[:2]:
        print(f"Curve with {len(curve['pts'])} points")
        print(f"  Fill: {curve['fill']}, Dash: {curve['dash']}")

    # Access image objects
    for img in page.images:
        print(f"Image at ({img['x0']}, {img['top']}), size: {img['srcsize']}")
        print(f"  Colorspace: {img['colorspace']}, Bits: {img['bits']}")

    # Access annotations and hyperlinks
    for annot in page.annots:
        print(f"Annotation: {annot['contents']}")

    for link in page.hyperlinks:
        print(f"Link: {link['uri']} at ({link['x0']}, {link['top']})")

    # Access edge objects (derived from lines, rects, curves)
    print(f"Total edges: {len(page.edges)}")
    print(f"Horizontal edges: {len(page.horizontal_edges)}")
    print(f"Vertical edges: {len(page.vertical_edges)}")
```

--------------------------------

### Visual Debugging API

Source: https://context7.com/jsvine/pdfplumber/llms.txt

Methods for rendering PDF pages as images and drawing annotations to visualize extracted data and structure.

```APIDOC
## POST /visual-debug

### Description
Converts a PDF page into an image object to perform visual debugging, such as drawing rectangles around characters, words, or lines, and visualizing table detection.

### Method
POST

### Parameters
#### Request Body
- **resolution** (int) - Optional - DPI for the rendered image.
- **width** (int) - Optional - Width in pixels.
- **height** (int) - Optional - Height in pixels.
- **antialias** (bool) - Optional - Enable smoother rendering.

### Response
#### Success Response (200)
- **image_object** (object) - An image object with methods like draw_rects, draw_lines, and save.
```

--------------------------------

### Create a PageImage object

Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md

Converts a PDF page into a PageImage object. Supports optional parameters like resolution and antialiasing to control output quality.

```python
im = my_pdf.pages[0].to_image(resolution=150)
```

--------------------------------

### Repair PDF on the fly with pdfplumber

Source: https://github.com/jsvine/pdfplumber/blob/stable/docs/repairing.md

This method repairs a malformed PDF file when opening it with pdfplumber. The repair is done in memory and the repaired version is not saved to disk. It takes the file path as input and accepts a boolean `repair` argument.

```python
import pdfplumber

with pdfplumber.open("malformed.pdf", repair=True) as pdf:
    # Process the PDF
    print(pdf.pages[0].extract_text())
```

--------------------------------

### Table Extraction Strategies

Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md

Explanation of the different strategies available for `vertical_strategy` and `horizontal_strategy` in pdfplumber's table extraction.

```APIDOC
## Table-extraction strategies

Both `vertical_strategy` and `horizontal_strategy` accept the following options:

| Strategy |
| Description |
|----------|-------------|
| `"lines"` | Use the page's graphical lines — including the sides of rectangle objects — as the borders of potential table-cells. |
| `"lines_strict"` | Use the page's graphical lines — but *not* the sides of rectangle objects — as the borders of potential table-cells. |
| `"text"` | For `vertical_strategy`: Deduce the (imaginary) lines that connect the left, right, or center of words on the page, and use those lines as the borders of potential table-cells. For `horizontal_strategy`, the same but using the tops of words. |
| `"explicit"` | Only use the lines explicitly defined in `explicit_vertical_lines` / `explicit_horizontal_lines`. |


```

--------------------------------

### Use Strict Line-Based Table Extraction Strategy

Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md

Explains the 'lines_strict' strategy for pdfplumber table extraction, which uses only graphical lines as cell borders, excluding the edges of rectangle objects. This provides a more constrained approach to identifying table structures.

```python
"vertical_strategy": "lines_strict",
"horizontal_strategy": "lines_strict"
```

--------------------------------

### Extract Table from PDF Page

Source: https://github.com/jsvine/pdfplumber/blob/stable/README.md

Demonstrates how to open a PDF file and extract the largest table from the first page using pdfplumber's extract_table method.

```python
import pdfplumber

with pdfplumber.open("path/to/my.pdf") as pdf:
    page = pdf.pages[0]
    table_data = page.extract_table()
```