### Install python-poppler from Git Source Source: https://github.com/cbrunet/python-poppler/blob/master/docs/installation.md This method involves cloning the python-poppler repository and then installing it using pip. This process compiles the C++ bindings and installs the package. Ensure all prerequisites, including the poppler library, are installed prior to running these commands. ```bash git clone https://github.com/cbrunet/python-poppler.git pip install --use-pep517 . ``` -------------------------------- ### Install python-poppler from PyPI Source: https://github.com/cbrunet/python-poppler/blob/master/docs/installation.md This command installs the python-poppler package using pip. Ensure all system requirements, including the correct poppler library version, are met beforehand. It's recommended to perform this installation within a Python virtual environment. ```bash pip install --use-pep517 python-poppler ``` -------------------------------- ### Verify Poppler Version in Python Source: https://github.com/cbrunet/python-poppler/blob/master/docs/installation.md A simple Python snippet to import the poppler library and print its version. This is used to verify that the installation was successful and that the correct version of the Poppler library is being used by the python-poppler bindings. ```python import poppler print(poppler.version()) ``` -------------------------------- ### Compile Poppler from Source and Set Environment Variables Source: https://github.com/cbrunet/python-poppler/blob/master/docs/installation.md Instructions for compiling a custom version of the Poppler library from its source code and setting environment variables to ensure python-poppler can find the compiled library. This is useful if a more recent version of Poppler is required than what is available in system repositories. It includes build configuration with CMake and setting PKG_CONFIG_PATH and LD_LIBRARY_PATH. ```bash git clone https://gitlab.freedesktop.org/poppler/poppler.git cd poppler git checkout poppler-0.89.0 mkdir build cd build cmake \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX:PATH=/usr/local \ -DENABLE_UNSTABLE_API_ABI_HEADERS=ON \ -DBUILD_GTK_TESTS=OFF \ -DBUILD_QT5_TESTS=OFF \ -DBUILD_CPP_TESTS=OFF \ -DENABLE_CPP=ON \ -DENABLE_GLIB=OFF \ -DENABLE_GOBJECT_INTROSPECTION=OFF \ -DENABLE_GTK_DOC=OFF \ -DENABLE_QT5=OFF \ -DBUILD_SHARED_LIBS=ON \ .. sudo make install export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ``` -------------------------------- ### Load and Render PDF Page in Python Source: https://github.com/cbrunet/python-poppler/blob/master/README.md Demonstrates loading a PDF document, accessing a specific page, extracting its text, and rendering the page into an image format using python-poppler. Requires the 'poppler' library to be installed. ```python from poppler import load_from_file, PageRenderer pdf_document = load_from_file("sample.pdf") page_1 = pdf_document.create_page(0) page_1_text = page_1.text() renderer = PageRenderer() image = renderer.render_page(page_1) image_data = image.data ``` -------------------------------- ### Get Font Information from PDF Page (Python) Source: https://github.com/cbrunet/python-poppler/blob/master/docs/usage.md Demonstrates how to retrieve font information associated with text boxes on a specific PDF page. This requires passing the `text_list_include_font` option to the `text_list` method. It shows how to access the font name and size for a given text box. ```python font_iterator = document.create_font_iterator() for page, fonts in font_iterator: print(f"Fonts for page {page}") for font in fonts: print(f"- {font.name}") ``` ```python boxes = pdf_page.text_list(pdf_page.TextListOption.text_list_include_font) box = boxes[0] assert box.has_font_info print(box.get_font_name()) print(box.get_font_size()) ``` -------------------------------- ### Get Named Destinations Map in Python Source: https://context7.com/cbrunet/python-poppler/llms.txt Retrieves a map of named destinations from a PDF document. It iterates through each destination, printing its name, type, page number, coordinates, zoom level, and change status flags. This functionality is useful for navigating or understanding the structure of a PDF document. ```python destinations = pdf_document.create_destination_map() for name, destination in destinations.items(): print(f"\nDestination: {name}") print(f"Type: {destination.type}") print(f"Page number: {destination.page_number}") # Destination coordinates and zoom print(f"Left: {destination.left}") print(f"Top: {destination.top}") print(f"Right: {destination.right}") print(f"Bottom: {destination.bottom}") print(f"Zoom: {destination.zoom}") # Check if destination values are set print(f"Is change left: {destination.is_change_left}") print(f"Is change top: {destination.is_change_top}") print(f"Is change zoom: {destination.is_change_zoom}") ``` -------------------------------- ### Convert PDF Image to QImage (Python) Source: https://github.com/cbrunet/python-poppler/blob/master/docs/usage.md Demonstrates converting a PDF page image into a Qt `QImage` object. This facilitates integration with Qt applications. The conversion requires a mapping between Poppler's `ImageFormat` and Qt's `QtGui.QImage.Format`. ```python # Assuming 'image' is an Image object obtained from rendering a PDF page # and 'QtGui' is imported from PyQt5 or PySide2 P2QFormat = { ImageFormat.invalid: QtGui.QImage.Format_Invalid, ImageFormat.argb32: QtGui.QImage.Format_ARGB32, ImageFormat.bgr24: QtGui.QImage.Format_BGR888, ImageFormat.gray8: QtGui.QImage.Format_Grayscale8, ImageFormat.mono: QtGui.QImage.Format_Mono, ImageFormat.rgb24: QtGui.QImage.Format_RGB888, } qimg = QtGui.QImage(image.data, image.width, image.height, image.bytes_per_row, P2QFormat[image.format]) ``` -------------------------------- ### Render PDF Page to Image (Python) Source: https://github.com/cbrunet/python-poppler/blob/master/docs/usage.md Illustrates the process of converting a PDF page into an image format. It involves creating a `PageRenderer` object and then using its `render_page` method to obtain an `Image` object. ```python # Assuming 'document' is a loaded Document object page_number = 0 # Example page number pdf_page = document.pages[page_number] # Create a PageRenderer object renderer = pdf_page.create_renderer() # Render the page to an Image object image = renderer.render_page() ``` -------------------------------- ### Load PDF Documents in Python Source: https://context7.com/cbrunet/python-poppler/llms.txt Demonstrates loading PDF documents from files, byte data, or file-like objects using python-poppler. It also shows how to handle password-protected PDFs and access basic document properties like page count and encryption status. Dependencies include the poppler library. ```python from poppler import load_from_file, load_from_data, load from pathlib import Path # Load from file path (string or Path object) pdf_document = load_from_file("document.pdf") # Load password-protected document pdf_document = load_from_file("secure.pdf", owner_password="owner", user_password="user") # Load from bytes with open("document.pdf", "rb") as f: file_data = f.read() pdf_document = load_from_data(file_data) # Load using generic function (accepts str, Path, bytes, or file-like objects) pdf_document = load("document.pdf", owner_password="owner") pdf_document = load(Path("document.pdf")) with open("document.pdf", "rb") as f: pdf_document = load(f) # Check document properties print(f"Pages: {pdf_document.pages}") print(f"Encrypted: {pdf_document.is_encrypted()}") print(f"Locked: {pdf_document.is_locked()}") print(f"PDF Version: {pdf_document.pdf_version}") # Returns tuple like (1, 5) # Unlock a locked document if pdf_document.is_locked(): unlocked = pdf_document.unlock("owner_pass", "user_pass") print(f"Successfully unlocked: {not unlocked}") ``` -------------------------------- ### Manage PDF Document Destinations and Links Source: https://context7.com/cbrunet/python-poppler/llms.txt This Python snippet shows how to load a PDF document and access its named destinations and document links for navigation purposes. It requires the `poppler` library version 0.74.0 or later. The code initializes the PDF document and prepares for potential operations on destinations and links, although the specific extraction logic for these is not detailed in the provided snippet. ```python from poppler import load_from_file, DestinationType pdf_document = load_from_file("document.pdf") ``` -------------------------------- ### Convert PDF Image to PIL Image (Python) Source: https://github.com/cbrunet/python-poppler/blob/master/docs/usage.md Shows how to convert a PDF page image into a PIL (Pillow) `Image` object. This is useful for further image manipulation with the Pillow library. Note that a copy of the image data is unavoidable in this conversion. ```python from PIL import Image, ImageTk # Assuming 'image' is an Image object obtained from rendering a PDF page pil_image = Image.frombytes( "RGBA", (image.width, image.height), image.data, "raw", str(image.format), ) # tk_image = ImageTk.PhotoImage(pil_image) # Example for Tkinter ``` -------------------------------- ### Create and Use Rectangles in Python Source: https://context7.com/cbrunet/python-poppler/llms.txt Illustrates the creation and manipulation of Rectangle objects for defining regions or bounding boxes within a PDF page. Rectangles can be created with specific coordinates and dimensions, and their properties can be accessed. They are also used to extract text from specific areas of a page. ```python from poppler import Rectangle # Create rectangle (x, y, width, height) rect = Rectangle(100.0, 150.0, 200.0, 300.0) # Access coordinates print(f"X: {rect.x}") print(f"Y: {rect.y}") print(f"Width: {rect.width}") print(f"Height: {rect.height}") # Get as tuple coords = rect.as_tuple() # (x, y, width, height) # Create empty rectangle empty_rect = Rectangle(0.0, 0.0, 0.0, 0.0) # Rectangles are used for text extraction regions page = pdf_document.create_page(0) region_text = page.text(rect=rect) ``` -------------------------------- ### Render PDF Page to Image Source: https://context7.com/cbrunet/python-poppler/llms.txt Renders a PDF page to raw image data. Supports customizable resolution, rendering hints (like antialiasing), paper color, image format, rotation, and specific region rendering. The output can be accessed as raw bytes or converted to NumPy arrays or PIL Images. ```python from poppler import load_from_file, PageRenderer, RenderHint, ImageFormat, Rotation import numpy as np from PIL import Image as PILImage pdf_document = load_from_file("document.pdf") page = pdf_document.create_page(0) # Create renderer with default settings renderer = PageRenderer() # Check if rendering is supported if not PageRenderer.can_render(): raise RuntimeError("Poppler compiled without rendering support") # Configure rendering options renderer.set_render_hint(RenderHint.antialiasing, True) renderer.set_render_hint(RenderHint.text_antialiasing, True) renderer.render_hints = RenderHint.antialiasing | RenderHint.text_antialiasing # Set paper color (default is white) renderer.paper_color = (255, 255, 255) # Set image format (requires poppler >= 0.65.0) renderer.image_format = ImageFormat.argb32 # Render page at 150 DPI image = renderer.render_page(page, xres=150.0, yres=150.0) # Render with rotation image = renderer.render_page(page, xres=72.0, yres=72.0, rotate=Rotation.rotate_90) # Render specific region (x, y, width, height in pixels) image = renderer.render_page(page, xres=72.0, yres=72.0, x=0, y=0, w=400, h=600) # Access image data print(f"Image size: {image.width}x{image.height}") print(f"Format: {image.format}") print(f"Bytes per row: {image.bytes_per_row}") print(f"Valid: {image.is_valid}") # Get raw image bytes image_bytes = image.data # Save image to file image.save("output.png", ImageFormat.argb32, dpi=150) # Convert to numpy array (zero-copy) array = np.array(image.memoryview(), copy=False) print(f"Array shape: {array.shape}") # Convert to PIL Image pil_image = PILImage.frombytes( "RGBA", (image.width, image.height), image.data, "raw", str(image.format) ) pil_image.save("output_pil.png") ``` -------------------------------- ### Access PDF Page Properties and Layout Source: https://context7.com/cbrunet/python-poppler/llms.txt This Python code retrieves and prints various properties of each page within a PDF document, including its label, orientation, duration, and dimensions for different page boxes (media, crop, bleed, trim, art). It also accesses page transition effects if present. Finally, it fetches and displays the document's overall page layout and mode. Dependencies include the `poppler` library and its `PageBox`, `PageLayout`, and `PageMode` enums. ```python from poppler import load_from_file, PageBox, Rotation pdf_document = load_from_file("document.pdf") for page_index in range(pdf_document.pages): page = pdf_document.create_page(page_index) print(f"\n--- Page {page_index} ---") print(f"Label: {page.label}") print(f"Orientation: {page.orientation}") print(f"Duration: {page.duration}") media_box = page.page_rect(PageBox.media_box) crop_box = page.page_rect(PageBox.crop_box) bleed_box = page.page_rect(PageBox.bleed_box) trim_box = page.page_rect(PageBox.trim_box) art_box = page.page_rect(PageBox.art_box) print(f"Media box: {media_box.as_tuple()}") print(f"Crop box: {crop_box.as_tuple()}") transition = page.transition() if transition: print(f"Transition type: {transition.type}") print(f"Duration: {transition.duration}") print(f"Alignment: {transition.alignment}") print(f"Direction: {transition.direction}") print(f"Angle: {transition.angle}") print(f"Scale: {transition.scale}") print(f"Rectangular: {transition.is_rectangular}") from poppler import PageLayout, PageMode print(f"Page layout: {pdf_document.page_layout}") print(f"Page mode: {pdf_document.page_mode}") ``` -------------------------------- ### Manage PDF Document Permissions Source: https://context7.com/cbrunet/python-poppler/llms.txt This Python script checks various permissions of a PDF document, such as the ability to print, modify, copy text, add annotations, fill forms, extract content for accessibility, assemble the document, and perform high-resolution printing. It requires the `poppler` library and optionally the owner password for protected documents. The output provides a clear indication of which permissions are granted or denied. ```python from poppler import load_from_file, Permission pdf_document = load_from_file("document.pdf", owner_password="owner") can_print = pdf_document.has_permission(Permission.print) can_modify = pdf_document.has_permission(Permission.change) can_copy = pdf_document.has_permission(Permission.copy) can_annotate = pdf_document.has_permission(Permission.add_notes) can_fill_forms = pdf_document.has_permission(Permission.fill_forms) can_extract = pdf_document.has_permission(Permission.accessibility) can_assemble = pdf_document.has_permission(Permission.assemble) can_print_hires = pdf_document.has_permission(Permission.print_high_resolution) print(f"Print: {can_print}") print(f"Modify: {can_modify}") print(f"Copy text: {can_copy}") print(f"Add annotations: {can_annotate}") print(f"Fill forms: {can_fill_forms}") print(f"Extract for accessibility: {can_extract}") print(f"Assemble document: {can_assemble}") print(f"High-resolution print: {can_print_hires}") all_permissions = [ ("Print", Permission.print), ("Modify", Permission.change), ("Copy", Permission.copy), ("Annotate", Permission.add_notes), ("Fill Forms", Permission.fill_forms), ("Accessibility", Permission.accessibility), ("Assemble", Permission.assemble), ("High-Res Print", Permission.print_high_resolution), ] print("\nPermissions summary:") for name, perm in all_permissions: status = "✓" if pdf_document.has_permission(perm) else "✗" print(f" {status} {name}") ``` -------------------------------- ### Manage PDF Document Metadata in Python Source: https://context7.com/cbrunet/python-poppler/llms.txt Illustrates how to read and modify standard and custom metadata for PDF documents using python-poppler. This includes author, title, dates, and user-defined fields. It also covers saving modified documents and accessing document IDs. Requires poppler version 0.46.0 or later for metadata modification. ```python from poppler import load_from_file from datetime import datetime pdf_document = load_from_file("document.pdf", owner_password="owner") # Read standard metadata properties print(f"Title: {pdf_document.title}") print(f"Author: {pdf_document.author}") print(f"Creator: {pdf_document.creator}") print(f"Producer: {pdf_document.producer}") print(f"Subject: {pdf_document.subject}") print(f"Keywords: {pdf_document.keywords}") print(f"Creation Date: {pdf_document.creation_date}") print(f"Modification Date: {pdf_document.modification_date}") # Get all metadata as dictionary infos = pdf_document.infos() for key, value in infos.items(): print(f"{key}: {value}") # Modify metadata (requires poppler >= 0.46.0) pdf_document.author = "Charles Brunet" pdf_document.title = "Sample Document" pdf_document.creation_date = datetime(2024, 1, 1, 12, 0, 0) pdf_document.keywords = "python, pdf, poppler" # Set custom metadata keys pdf_document.set_info_key("CustomField", "Custom Value") pdf_document.set_info_date("CustomDate", datetime.now()) # Save modified document pdf_document.save("modified_document.pdf") # Save a copy without modifications pdf_document.save_a_copy("copy_document.pdf") # Get PDF ID pdf_id = pdf_document.pdf_id print(f"Permanent ID: {pdf_id.permanent_id}") print(f"Update ID: {pdf_id.update_id}") ``` -------------------------------- ### Navigate PDF Table of Contents Source: https://context7.com/cbrunet/python-poppler/llms.txt Provides functionality to access and traverse the table of contents (TOC) structure of a PDF document. It allows retrieving the root of the TOC and recursively printing its items, including their titles and open/closed status. Child items can also be accessed directly. ```python from poppler import load_from_file pdf_document = load_from_file("document.pdf") # Get table of contents toc = pdf_document.create_toc() if toc: # Get root item root = toc.root def print_toc_item(item, level=0): """Recursively print TOC structure""" indent = " " * level open_status = "[open]" if item.is_open else "[closed]" print(f"{indent}{item.title} {open_status}") # Iterate through children for child in item: print_toc_item(child, level + 1) # Print entire TOC print_toc_item(root) # Access children directly children = root.children() for child in children: print(f"TOC Item: {child.title}") print(f"Is Open: {child.is_open}") else: print("Document has no table of contents") ``` -------------------------------- ### Convert PDF Image to NumPy Array (Python) Source: https://github.com/cbrunet/python-poppler/blob/master/docs/usage.md Explains how to convert a PDF page image into a NumPy array using the buffer protocol via `memoryview`. This allows direct access and modification of image data without copying, enabling efficient array operations. Changes to the NumPy array directly affect the image data. ```python import numpy # Assuming 'image' is an Image object obtained from rendering a PDF page a = numpy.array(image.memoryview(), copy=False) print(a[0, 0, 0]) print(image.data[0]) # Value of the first byte of the image a[0, 0, 0] = 0 print(image.data[0]) # It is now 0 ``` -------------------------------- ### Enable/Disable Poppler Logging (Python) Source: https://github.com/cbrunet/python-poppler/blob/master/docs/usage.md Provides methods to control the logging output of the Poppler library. You can disable all error messages by calling `enable_logging(False)` and re-enable them by calling `enable_logging(True)`. ```python # disable logging poppler.enable_logging(False) # enable logging to stderr again poppler.enable_logging(True) ``` -------------------------------- ### Inspect PDF Fonts Source: https://context7.com/cbrunet/python-poppler/llms.txt Retrieves information about fonts used in a PDF document. It can fetch all fonts at once or iterate through them page by page. Information includes font name, type, embedding status, and subset status. Supports mapping FontType enum to human-readable names. ```python from poppler import load_from_file, FontType pdf_document = load_from_file("document.pdf") # Get all fonts at once fonts = pdf_document.fonts() for font in fonts: print(f"Name: {font.name}") print(f"Type: {font.type}") print(f"Embedded: {font.is_embedded}") print(f"Subset: {font.is_subset}") print(f"File: {font.file}") # Iterate through fonts page by page font_iterator = pdf_document.create_font_iterator(start_page=0) for page_num, page_fonts in font_iterator: print(f"\nFonts on page {page_num}:") for font in page_fonts: font_type_name = { FontType.unknown: "Unknown", FontType.type1: "Type 1", FontType.type1c: "Type 1C", FontType.type1c_ot: "Type 1C OpenType", FontType.type3: "Type 3", FontType.truetype: "TrueType", FontType.truetype_ot: "TrueType OpenType", FontType.cid_type0: "CID Type 0", FontType.cid_type0c: "CID Type 0C", FontType.cid_type0c_ot: "CID Type 0C OpenType", FontType.cid_truetype: "CID TrueType", FontType.cid_truetype_ot: "CID TrueType OpenType", }.get(font.type, "Unknown") embed_status = "embedded" if font.is_embedded else "not embedded" subset_status = "(subset)" if font.is_subset else "(full)" print(f" - {font.name} [{font_type_name}] {embed_status} {subset_status}") # Check current page of iterator print(f"Current page: {font_iterator.current_page}") print(f"Has next: {font_iterator.has_next}") ``` -------------------------------- ### Extract Embedded Files from PDF Source: https://context7.com/cbrunet/python-poppler/llms.txt Allows extraction and access to files embedded within PDF documents. This function loads the PDF and prepares for the retrieval of any attached files. ```python from poppler import load_from_file pdf_document = load_from_file("document.pdf") # Further code to access and extract embedded files would go here. ``` -------------------------------- ### Extract and Search Text from PDF Pages in Python Source: https://context7.com/cbrunet/python-poppler/llms.txt Details text extraction from specific areas or the entire page of a PDF using python-poppler. It also covers searching for text and retrieving detailed text box information, including bounding boxes and font details. Requires poppler version 0.63.0 or later for detailed text boxes and 0.89.0 for font information. ```python from poppler import load_from_file, Rectangle, CaseSensitivity, SearchDirection pdf_document = load_from_file("document.pdf") page = pdf_document.create_page(0) # Get first page (0-indexed) # Extract all text from page full_text = page.text() print(full_text) # Extract text from specific rectangle area rect = Rectangle(100.0, 100.0, 300.0, 400.0) region_text = page.text(rect) print(region_text) # Extract text with layout mode from poppler import TextLayout text_with_layout = page.text(layout_mode=TextLayout.physical_layout) # Get detailed text boxes with positions (requires poppler >= 0.63.0) text_boxes = page.text_list() for text_box in text_boxes: print(f"Text: {text_box.text}") print(f"Bounding box: {text_box.bbox.as_tuple()}") print(f"Has space after: {text_box.has_space_after}") # Get character-level bounding boxes for i in range(len(text_box.text)): char_bbox = text_box.char_bbox(i) print(f" Char '{text_box.text[i]}' at {char_bbox.as_tuple()}") # Get font information from text boxes (requires poppler >= 0.89.0) from poppler import Page text_boxes = page.text_list(Page.TextListOption.text_list_include_font) box = text_boxes[0] if box.has_font_info: print(f"Font: {box.get_font_name()}") print(f"Size: {box.get_font_size()}") print(f"Writing mode: {box.get_wmode()}") ``` -------------------------------- ### Extract Embedded Files from PDF Document Source: https://context7.com/cbrunet/python-poppler/llms.txt This Python snippet demonstrates how to check if a PDF document contains embedded files, retrieve a list of these files, and extract their data. It iterates through each embedded file, printing its metadata such as name, description, MIME type, size, checksum, dates, and validity. The extracted file data is then saved to disk. This functionality requires the `poppler` library. ```python from poppler import load_from_file pdf_document = load_from_file("document.pdf") if pdf_document.has_embedded_files(): embedded_files = pdf_document.embedded_files() for embedded_file in embedded_files: print(f"Name: {embedded_file.name}") print(f"Description: {embedded_file.description}") print(f"MIME type: {embedded_file.mime_type}") print(f"Size: {embedded_file.size} bytes") print(f"Checksum: {embedded_file.checksum}") print(f"Creation date: {embedded_file.creation_date}") print(f"Modification date: {embedded_file.modification_date}") print(f"Valid: {embedded_file.is_valid}") file_data = embedded_file.data with open(f"extracted_{embedded_file.name}", "wb") as f: f.write(file_data) else: print("No embedded files found") ``` -------------------------------- ### Search Text in PDF Page Source: https://context7.com/cbrunet/python-poppler/llms.txt Searches for a specific text string within a PDF page. It takes the search term, a rectangle to limit the search area, a search direction, and case sensitivity as input. It returns the rectangle where the text was found or None if not found. ```python from poppler import Rectangle, SearchDirection, CaseSensitivity search_rect = Rectangle(0.0, 0.0, 0.0, 0.0) found_rect = page.search( "searchterm", search_rect, SearchDirection.from_top, CaseSensitivity.case_sensitive ) if found_rect: print(f"Found at: {found_rect.as_tuple()}") else: print("Not found") ``` -------------------------------- ### Suppress PDF Error Logging in Python Source: https://context7.com/cbrunet/python-poppler/llms.txt Demonstrates how to disable and re-enable logging for poppler errors. This is useful for preventing noisy stderr output when processing potentially problematic PDFs. Ensure poppler version is 0.30.0 or higher. Errors are suppressed when logging is disabled. ```python from poppler import load_from_file, enable_logging # Disable error logging (suppresses stderr output) enable_logging(False) # Load document that might have errors pdf_document = load_from_file("problematic.pdf") page = pdf_document.create_page(0) text = page.text() # No error messages printed # Re-enable logging enable_logging(True) # Now errors will be printed to stderr again ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.