### Install docx2python via pip Source: https://github.com/shayhill/docx2python/blob/master/README.md The standard command to install the docx2python library into your Python environment. ```bash pip install docx2python ``` -------------------------------- ### Save Images from Docx Source: https://github.com/shayhill/docx2python/blob/master/README.md Provides examples of how to save images extracted from a .docx file. It demonstrates both saving images to a directory during initial extraction and saving them after the DocxContent object has been created. ```python from docx2python import docx2python # save images to a directory using the save_images method with docx2python('path/to/file.docx') as docx_content: docx_content.save_images('path/to/image/directory') ``` ```python from docx2python import docx2python from io import BytesIO # Example of iterating through images and saving them manually # Assuming 'result' is a DocxContent object obtained earlier # for name, image in result.images.items(): # with open(name, 'wb') as image_destination: # image_destination.write(image) ``` -------------------------------- ### Saving Images from DOCX Source: https://context7.com/shayhill/docx2python/llms.txt Illustrates the process of extracting and saving images embedded within a .docx file. The example shows specifying an output directory when initializing the docx2python object, which automatically saves the images. It also notes how image references appear in the extracted text. ```python from docx2python import docx2python # Method 1: Specify image folder when opening with docx2python('document.docx', 'output/images/') as doc: # Images are automatically saved to output/images/ # Access image names in text as: ----media/image1.png---- print(doc.text) ``` -------------------------------- ### Main Function: docx2python Source: https://context7.com/shayhill/docx2python/llms.txt The primary entry point for extracting content from docx files. Opens a docx file and returns a DocxContent object with all extracted content accessible through various properties. ```APIDOC ## Main Function: docx2python ### Description The primary entry point for extracting content from docx files. Opens a docx file and returns a DocxContent object with all extracted content accessible through various properties. ### Method Not applicable (Python function) ### Endpoint Not applicable (Python function) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ```python from docx2python import docx2python # Basic extraction with context manager (recommended) with docx2python('document.docx') as doc: # Get all text as a single string print(doc.text) # Get body content as nested lists [table][row][cell][paragraph] for table in doc.body: for row in table: for cell in row: for paragraph in cell: print(paragraph) # Extract with HTML formatting enabled with docx2python('document.docx', html=True) as doc: # Text formatting converted to HTML tags # bold, italic, underline print(doc.body[0][0][0]) # ['Bold text', 'Italic text'] # Extract and save images to a directory with docx2python('document.docx', 'output/images/') as doc: print(doc.text) # Images saved to output/images/ # Handle merged cells in tables (default behavior duplicates content) with docx2python('document.docx', duplicate_merged_cells=True) as doc: # Tables are always nxm with merged cell content duplicated table = doc.body[0] # First table # Without context manager - remember to close doc = docx2python('document.docx') print(doc.text) doc.close() ``` ### Response #### Success Response (200) Returns a DocxContent object. #### Response Example ```json { "DocxContent": "Object with properties like text, body, header, footer, etc." } ``` ``` -------------------------------- ### Basic docx2python Extraction Source: https://context7.com/shayhill/docx2python/llms.txt Demonstrates the primary function for extracting content from a .docx file using a context manager. It shows how to access all text as a single string and iterate through the body content structured as nested lists representing tables, rows, cells, and paragraphs. It also illustrates enabling HTML formatting for text styles. ```python from docx2python import docx2python # Basic extraction with context manager (recommended) with docx2python('document.docx') as doc: # Get all text as a single string print(doc.text) # Get body content as nested lists [table][row][cell][paragraph] for table in doc.body: for row in table: for cell in row: for paragraph in cell: print(paragraph) # Extract with HTML formatting enabled with docx2python('document.docx', html=True) as doc: # Text formatting converted to HTML tags # bold, italic, underline print(doc.body[0][0][0]) # ['Bold text', 'Italic text'] # Extract and save images to a directory with docx2python('document.docx', 'output/images/') as doc: print(doc.text) # Images saved to output/images/ # Handle merged cells in tables (default behavior duplicates content) with docx2python('document.docx', duplicate_merged_cells=True) as doc: # Tables are always nxm with merged cell content duplicated table = doc.body[0] # First table # Without context manager - remember to close doc = docx2python('document.docx') print(doc.text) doc.close() ``` -------------------------------- ### Process docx files in memory with BytesIO Source: https://context7.com/shayhill/docx2python/llms.txt Demonstrates how to extract text and HTML content from docx files stored in memory using BytesIO. This is useful for handling file streams without writing to the disk. ```python from io import BytesIO from docx2python import docx2python import requests # Read from bytes in memory with open('document.docx', 'rb') as f: docx_bytes = BytesIO(f.read()) with docx2python(docx_bytes) as doc: print(doc.text) # Download and process directly response = requests.get('https://example.com/document.docx') docx_bytes = BytesIO(response.content) with docx2python(docx_bytes, html=True) as doc: print(doc.body) ``` -------------------------------- ### Extract and Save Images from DOCX Source: https://context7.com/shayhill/docx2python/llms.txt Demonstrates how to extract images from a Word document either by saving them directly to a directory or by accessing the raw binary data for custom processing. ```python from docx2python import docx2python # Save images to directory with docx2python('document.docx') as doc: doc.save_images('output/images/') # Access image binary data with docx2python('document.docx') as doc: for name, image_bytes in doc.images.items(): with open(f'output/{name}', 'wb') as f: f.write(image_bytes) ``` -------------------------------- ### Generate HTML Document Map Source: https://context7.com/shayhill/docx2python/llms.txt Create an HTML visualization of the document structure, which is useful for debugging content positions and table layouts. ```python from docx2python import docx2python with docx2python('document.docx') as doc: html_content = doc.html_map with open('document_map.html', 'w') as f: f.write(html_content) ``` -------------------------------- ### Accessing DocxContent Properties Source: https://context7.com/shayhill/docx2python/llms.txt Shows how to access various parts of the document using properties of the DocxContent object. This includes main content areas (header, footer, body, footnotes, endnotes), run-level details, Par objects with metadata, convenience properties like all text and HTML map, document metadata, and raw image data. ```python from docx2python import docx2python with docx2python('document.docx') as doc: # Main content areas (4-deep nested lists: [table][row][cell][paragraph]) header_content = doc.header # Header text footer_content = doc.footer # Footer text body_content = doc.body # Main document body footnotes = doc.footnotes # Footnotes endnotes = doc.endnotes # Endnotes full_doc = doc.document # header + body + footer + footnotes + endnotes # Run-level access (5-deep nested lists: [table][row][cell][paragraph][run]) header_runs = doc.header_runs # Individual text runs in header body_runs = doc.body_runs # Individual text runs in body document_runs = doc.document_runs # All text runs # Par objects with full metadata (4-deep: [table][row][cell][Par]) body_pars = doc.body_pars # Par instances with styles and lineage document_pars = doc.document_pars # All Par instances # Convenience properties all_text = doc.text # All text joined with "\n\n" html_map = doc.html_map # Visual HTML map of content with indices # Document metadata props = doc.core_properties # {'creator': 'Author', 'lastModifiedBy': '...'} # Images as binary data images = doc.images # {'image1.png': b'...', 'image2.jpg': b'...'} ``` -------------------------------- ### Extract Docx Content with Docx2Python Source: https://github.com/shayhill/docx2python/blob/master/README.md Demonstrates basic extraction of text content from a .docx file using the docx2python library. It shows how to use context management for automatic closing and manual closing of the document object. ```python from docx2python import docx2python # extract docx content with docx2python('path/to/file.docx') as docx_content: print(docx_content.text) docx_content = docx2python('path/to/file.docx') print(docx_content.text) docx_content.close() ``` -------------------------------- ### Navigate Document Structure with Iterators Source: https://context7.com/shayhill/docx2python/llms.txt Utilize helper functions to traverse the nested document structure, enumerate positions, and filter content like empty paragraphs. ```python from docx2python import docx2python from docx2python.iterators import iter_at_depth, enum_cells, iter_tables with docx2python('document.docx') as doc: for paragraph in iter_at_depth(doc.body, 4): print(paragraph) # Filter out empty paragraphs tables = doc.body for (i, j, k), cell in enum_cells(tables): tables[i][j][k] = [p for p in cell if p] ``` -------------------------------- ### Paragraph Handling with Par Instances in Docx2Python Source: https://github.com/shayhill/docx2python/blob/master/README.md Illustrates the use of Par instances introduced in Version 3 of docx2python. Each paragraph is returned as a Par instance, enabling the extraction of paragraph-specific properties such as heading levels. ```python [ # document [ [ [ Par instance, Par instance, Par instance ] ] ] ] ``` -------------------------------- ### Access raw XML and DocxReader Source: https://context7.com/shayhill/docx2python/llms.txt Shows how to access the underlying XML structure of a docx file using the docx_reader. This allows for advanced manipulation of content files and saving modified documents. ```python from docx2python import docx2python with docx2python('document.docx') as doc: reader = doc.docx_reader # Access content files for file in reader.content_files(): root = file.root_element # Work with lxml etree elements print(f"File: {file}") # Get specific file types office_doc = reader.file_of_type('officeDocument') # Save modified document # (After making changes to root_element) reader.save('modified.docx') ``` -------------------------------- ### Expose Intermediate XML Parsing Functionality (Python) Source: https://github.com/shayhill/docx2python/blob/master/README.md Docx2python v2 separates and exposes intermediate steps for navigating and extracting content from XML documents. This allows developers to iterate through the document, identify specific elements (e.g., paragraphs with particular formats), and extract their content, facilitating easier extension and custom processing. ```python # See docx_reader.py for module details. # See utilities.py for examples of major new features. ``` -------------------------------- ### Analyze Paragraph Metadata and Styling Source: https://context7.com/shayhill/docx2python/llms.txt Access detailed paragraph information including styles, HTML formatting, document lineage, and list positioning using Par objects. ```python from docx2python import docx2python with docx2python('document.docx', html=True) as doc: for par in doc.document_pars[0][0][0]: print(f"Style: {par.style}") print(f"HTML style: {par.html_style}") print(f"Lineage: {par.lineage}") print(f"List position: {par.list_position}") for run in par.runs: print(f" Run text: {run.text}") print(f"Formatted: {par.run_strings}") ``` -------------------------------- ### Export XML and Writing Functions (Python) Source: https://github.com/shayhill/docx2python/blob/master/README.md Docx2python v2 provides access to extracted XML files and the functions used to write them back into a DOCX. This functionality enables users to perform light editing, such as search and replace, for document templating. ```python # Example usage (conceptual): # from docx2python import Docx2Python # d = Docx2Python("my_document.docx") # extracted_xml = d.extract_xml() # d.write_xml_to_docx(extracted_xml, "output_document.docx") ``` -------------------------------- ### Extract Docx Content with Image Saving Source: https://github.com/shayhill/docx2python/blob/master/README.md Shows how to extract text content from a .docx file and simultaneously save any embedded images to a specified directory. This is useful for processing documents that contain visual assets. ```python from docx2python import docx2python # extract docx content, write images to image_directory with docx2python('path/to/file.docx', 'path/to/image_directory') as docx_content: print(docx_content.text) ``` -------------------------------- ### Perform Text Replacement and Metadata Extraction Source: https://context7.com/shayhill/docx2python/llms.txt Use utility functions to perform search-and-replace operations on document templates and extract document-wide metadata like hyperlinks and headings. ```python from docx2python.utilities import replace_docx_text, get_links, get_headings replace_docx_text('template.docx', 'output.docx', ('#NAME#', 'John Doe')) for href, text in get_links('document.docx'): print(f"Link: {text} -> {href}") for heading_runs in get_headings('document.docx'): print(f"Heading: {''.join(heading_runs)}") ``` -------------------------------- ### Handling Text Runs in Docx2Python Source: https://github.com/shayhill/docx2python/blob/master/README.md Demonstrates the structure of text runs within a DOCX document as parsed by docx2python. Version 2 introduced run attributes, allowing for finer-grained text extraction, useful for identifying specific formatting like italics or hyperlinks. ```python [ # document [ [ [ "a text run", "runs break when formatting changes", "--", "runs break with bullets and special insertions", ] ] ] ] ``` -------------------------------- ### POST /docx2python/extract Source: https://github.com/shayhill/docx2python/blob/master/README.md Extracts content from a provided .docx file into a structured Python object. ```APIDOC ## POST /docx2python/extract ### Description Parses a .docx file and returns a structured object containing text, images, and metadata. Supports configuration for HTML output and table cell handling. ### Method POST ### Endpoint /docx2python/extract ### Parameters #### Request Body - **docx_path** (string) - Required - Path to the .docx file. - **html** (boolean) - Optional - If True, exports text with HTML formatting tags. Default: False. - **duplicate_merged_cells** (boolean) - Optional - If True, fills merged table cells with content from adjacent cells. Default: True. ### Request Example { "docx_path": "/path/to/document.docx", "html": true, "duplicate_merged_cells": true } ### Response #### Success Response (200) - **body** (list) - The extracted document content structured as nested lists. - **images** (dict) - Extracted images mapping placeholder names to binary data. - **properties** (dict) - Document metadata such as creator and lastModifiedBy. #### Response Example { "body": [[["Paragraph 1 text"]]], "images": {"image1.jpg": "binary_data"}, "properties": {"creator": "Author Name"} } ``` -------------------------------- ### Extract Docx Content with HTML Formatting Source: https://github.com/shayhill/docx2python/blob/master/README.md Illustrates how to extract text content from a .docx file while converting basic font styles (like bold, italic, underline) into HTML tags. This option is useful for preserving some text formatting during extraction. ```python from docx2python import docx2python # extract docx content with basic font styles converted to html with docx2python('path/to/file.docx', html=True) as docx_content: print(docx_content.text) ``` -------------------------------- ### Merging Consecutive Runs with Identical Formatting Source: https://github.com/shayhill/docx2python/blob/master/README.md Explains the functionality of merging consecutive text runs that have identical formatting, a feature introduced in Version 2 of docx2python. This process addresses issues where MS Word arbitrarily breaks text runs, which can hinder algorithmic text processing. ```xml work to im prove docx2python ``` -------------------------------- ### Generating HTML Map of Document Contents with docx2python Source: https://github.com/shayhill/docx2python/blob/master/README.md A function that transforms DOCX table data into an HTML structure, creating a visual map of the document's content. It uses `enum_at_depth` to process paragraphs, cells, rows, and tables, wrapping them in appropriate HTML tags. ```python from docx2python.iterators import enum_at_depth def html_map(tables) -> str: """Create an HTML map of document contents. Render this in a browser to visually search for data. :tables: value could come from, e.g., * docx_to_text_output.document * docx_to_text_output.body """ # prepend index tuple to each paragraph for (i, j, k, l), paragraph in enum_at_depth(tables, 4): tables[i][j][k][l] = " ".join([str((i, j, k, l)), paragraph]) # wrap each paragraph in
 tags
    for (i, j, k), cell in enum_at_depth(tables, 3):
        tables[i][j][k] = "".join(["
{x}
".format(x) for x in cell]) # wrap each cell in tags for (i, j), row in enum_at_depth(tables, 2): tables[i][j] = "".join(["{x}".format(x) for x in row]) # wrap each row in tags for (i,), table in enum_at_depth(tables, 1): tables[i] = "".join(["{x}".format(x) for x in table]) # wrap each table in tags tables = "".join(['
{x}
'.format(x) for x in tables]) return [""] + tables + [""] ``` -------------------------------- ### Saving Images Source: https://context7.com/shayhill/docx2python/llms.txt Extract and save embedded images from docx files. ```APIDOC ## Saving Images ### Description Extract and save embedded images from docx files. ### Method Not applicable (Python function) ### Endpoint Not applicable (Python function) ### Parameters None ### Request Example ```python from docx2python import docx2python # Method 1: Specify image folder when opening with docx2python('document.docx', 'output/images/') as doc: # Images are automatically saved to output/images/ # Access image names in text as: ----media/image1.png---- print(doc.text) ``` ### Response #### Success Response (200) Images are saved to the specified directory. The `doc.text` property will contain placeholders for images. #### Response Example ```json { "message": "Images saved to output/images/", "text_with_placeholders": "Some text ----media/image1.png---- more text." } ``` ``` -------------------------------- ### Capture Paragraph Styles (HTML) Source: https://github.com/shayhill/docx2python/blob/master/README.md Docx2python v2 captures and represents paragraph styles, such as 'Heading 1', in the output HTML. This contrasts with v1, which ignored these styles even when 'html=True' was specified. ```html

h1 is a paragraph stylebold is a run style

``` -------------------------------- ### DocxContent Properties Source: https://context7.com/shayhill/docx2python/llms.txt The DocxContent object provides access to different parts of the document through various properties. Each returns nested lists representing the document structure. ```APIDOC ## DocxContent Properties ### Description The DocxContent object provides access to different parts of the document through various properties. Each returns nested lists representing the document structure. ### Method Not applicable (Python object properties) ### Endpoint Not applicable (Python object properties) ### Parameters None ### Request Example ```python from docx2python import docx2python with docx2python('document.docx') as doc: # Main content areas (4-deep nested lists: [table][row][cell][paragraph]) header_content = doc.header # Header text footer_content = doc.footer # Footer text body_content = doc.body # Main document body footnotes = doc.footnotes # Footnotes endnotes = doc.endnotes # Endnotes full_doc = doc.document # header + body + footer + footnotes + endnotes # Run-level access (5-deep nested lists: [table][row][cell][paragraph][run]) header_runs = doc.header_runs # Individual text runs in header body_runs = doc.body_runs # Individual text runs in body document_runs = doc.document_runs # All text runs # Par objects with full metadata (4-deep: [table][row][cell][Par]) body_pars = doc.body_pars # Par instances with styles and lineage document_pars = doc.document_pars # All Par instances # Convenience properties all_text = doc.text # All text joined with "\n\n" html_map = doc.html_map # Visual HTML map of content with indices # Document metadata props = doc.core_properties # {'creator': 'Author', 'lastModifiedBy': '...'} # Images as binary data images = doc.images # {'image1.png': b'...', 'image2.jpg': b'...'} ``` ### Response #### Success Response (200) Access to various document components via properties of the DocxContent object. #### Response Example ```json { "header": "[list of header content]", "body": "[list of body content]", "text": "All extracted text content." } ``` ``` -------------------------------- ### Merge Consecutive Links with Identical Hrefs (XML) Source: https://github.com/shayhill/docx2python/blob/master/README.md Docx2python v2 preprocesses XML to merge consecutive hyperlink elements that share the same 'r:id' pointing to the same URL. This resolves issues where MS Word breaks links into multiple parts, ensuring proper link representation in the output. ```xml docx2py thon ``` ```html docx2py thon ``` -------------------------------- ### Handle Nested Paragraphs Correctly (XML) Source: https://github.com/shayhill/docx2python/blob/master/README.md Docx2python v2 correctly manages nested paragraphs within the XML structure. Unlike v1, which could omit closing HTML tags when encountering nested paragraphs, v2 ensures proper tag closure and accurate HTML representation. ```xml text text text ``` ```html outer par bold text This text is in nested par (not bold) outer par bold text ``` ```html outer par bold text This text is in nested par (not bold) outer par bold text ``` -------------------------------- ### Accessing Comments from DocxContent Object Source: https://github.com/shayhill/docx2python/blob/master/README.md Demonstrates how to retrieve comments embedded within a DOCX file using the `comments` attribute of the `DocxContent` object. Each comment is returned as a tuple containing reference text, author, date, and the comment text. ```python with docx2python('path/to/file.docx') as docx_content: print(docx_content.comments) ``` -------------------------------- ### Extracting Document Comments Source: https://context7.com/shayhill/docx2python/llms.txt Demonstrates how to extract comments embedded within a .docx file. The code iterates through the comments, printing the referenced text, author, date, and the comment's content. Each comment is returned as a tuple containing these details. ```python from docx2python import docx2python with docx2python('document_with_comments.docx') as doc: # Comments are tuples of (reference_text, author, date, comment_text) for comment in doc.comments: reference_text, author, date, comment_text = comment print(f"Author: {author}") print(f"Date: {date}") print(f"Referenced text: {reference_text}") print(f"Comment: {comment_text}") print("---") ``` -------------------------------- ### Extracting Comments Source: https://context7.com/shayhill/docx2python/llms.txt Extract document comments with their reference text, author, date, and comment content. ```APIDOC ## Extracting Comments ### Description Extract document comments with their reference text, author, date, and comment content. ### Method Not applicable (Python function) ### Endpoint Not applicable (Python function) ### Parameters None ### Request Example ```python from docx2python import docx2python with docx2python('document_with_comments.docx') as doc: # Comments are tuples of (reference_text, author, date, comment_text) for comment in doc.comments: reference_text, author, date, comment_text = comment print(f"Author: {author}") print(f"Date: {date}") print(f"Referenced text: {reference_text}") print(f"Comment: {comment_text}") print("---") ``` ### Response #### Success Response (200) Returns a list of comments, where each comment is a tuple containing reference text, author, date, and comment content. #### Response Example ```json [ { "reference_text": "This paragraph needs review", "author": "John Doe", "date": "2024-01-15T10:30:00Z", "comment_text": "Please clarify this section" } ] ``` ``` -------------------------------- ### Removing Empty Paragraphs with docx2python.iterators Source: https://github.com/shayhill/docx2python/blob/master/README.md A utility function that iterates through tables in a DOCX document and removes any empty paragraphs within cells. It utilizes the `enum_cells` iterator from the `docx2python.iterators` module. ```python from docx2python.iterators import enum_cells def remove_empty_paragraphs(tables): for (i, j, k), cell in enum_cells(tables): tables[i][j][k] = [x for x in cell if x] ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.