### Install docx2python via pip
Source: https://github.com/shayhill/docx2python/blob/master/README.md
The standard command to install the docx2python library into your Python environment.
```bash
pip install docx2python
```
--------------------------------
### Save Images from Docx
Source: https://github.com/shayhill/docx2python/blob/master/README.md
Provides examples of how to save images extracted from a .docx file. It demonstrates both saving images to a directory during initial extraction and saving them after the DocxContent object has been created.
```python
from docx2python import docx2python
# save images to a directory using the save_images method
with docx2python('path/to/file.docx') as docx_content:
docx_content.save_images('path/to/image/directory')
```
```python
from docx2python import docx2python
from io import BytesIO
# Example of iterating through images and saving them manually
# Assuming 'result' is a DocxContent object obtained earlier
# for name, image in result.images.items():
# with open(name, 'wb') as image_destination:
# image_destination.write(image)
```
--------------------------------
### Saving Images from DOCX
Source: https://context7.com/shayhill/docx2python/llms.txt
Illustrates the process of extracting and saving images embedded within a .docx file. The example shows specifying an output directory when initializing the docx2python object, which automatically saves the images. It also notes how image references appear in the extracted text.
```python
from docx2python import docx2python
# Method 1: Specify image folder when opening
with docx2python('document.docx', 'output/images/') as doc:
# Images are automatically saved to output/images/
# Access image names in text as: ----media/image1.png----
print(doc.text)
```
--------------------------------
### Main Function: docx2python
Source: https://context7.com/shayhill/docx2python/llms.txt
The primary entry point for extracting content from docx files. Opens a docx file and returns a DocxContent object with all extracted content accessible through various properties.
```APIDOC
## Main Function: docx2python
### Description
The primary entry point for extracting content from docx files. Opens a docx file and returns a DocxContent object with all extracted content accessible through various properties.
### Method
Not applicable (Python function)
### Endpoint
Not applicable (Python function)
### Parameters
#### Path Parameters
None
#### Query Parameters
None
#### Request Body
None
### Request Example
```python
from docx2python import docx2python
# Basic extraction with context manager (recommended)
with docx2python('document.docx') as doc:
# Get all text as a single string
print(doc.text)
# Get body content as nested lists [table][row][cell][paragraph]
for table in doc.body:
for row in table:
for cell in row:
for paragraph in cell:
print(paragraph)
# Extract with HTML formatting enabled
with docx2python('document.docx', html=True) as doc:
# Text formatting converted to HTML tags
# bold, italic, underline
print(doc.body[0][0][0]) # ['Bold text', 'Italic text']
# Extract and save images to a directory
with docx2python('document.docx', 'output/images/') as doc:
print(doc.text) # Images saved to output/images/
# Handle merged cells in tables (default behavior duplicates content)
with docx2python('document.docx', duplicate_merged_cells=True) as doc:
# Tables are always nxm with merged cell content duplicated
table = doc.body[0] # First table
# Without context manager - remember to close
doc = docx2python('document.docx')
print(doc.text)
doc.close()
```
### Response
#### Success Response (200)
Returns a DocxContent object.
#### Response Example
```json
{
"DocxContent": "Object with properties like text, body, header, footer, etc."
}
```
```
--------------------------------
### Basic docx2python Extraction
Source: https://context7.com/shayhill/docx2python/llms.txt
Demonstrates the primary function for extracting content from a .docx file using a context manager. It shows how to access all text as a single string and iterate through the body content structured as nested lists representing tables, rows, cells, and paragraphs. It also illustrates enabling HTML formatting for text styles.
```python
from docx2python import docx2python
# Basic extraction with context manager (recommended)
with docx2python('document.docx') as doc:
# Get all text as a single string
print(doc.text)
# Get body content as nested lists [table][row][cell][paragraph]
for table in doc.body:
for row in table:
for cell in row:
for paragraph in cell:
print(paragraph)
# Extract with HTML formatting enabled
with docx2python('document.docx', html=True) as doc:
# Text formatting converted to HTML tags
# bold, italic, underline
print(doc.body[0][0][0]) # ['Bold text', 'Italic text']
# Extract and save images to a directory
with docx2python('document.docx', 'output/images/') as doc:
print(doc.text) # Images saved to output/images/
# Handle merged cells in tables (default behavior duplicates content)
with docx2python('document.docx', duplicate_merged_cells=True) as doc:
# Tables are always nxm with merged cell content duplicated
table = doc.body[0] # First table
# Without context manager - remember to close
doc = docx2python('document.docx')
print(doc.text)
doc.close()
```
--------------------------------
### Process docx files in memory with BytesIO
Source: https://context7.com/shayhill/docx2python/llms.txt
Demonstrates how to extract text and HTML content from docx files stored in memory using BytesIO. This is useful for handling file streams without writing to the disk.
```python
from io import BytesIO
from docx2python import docx2python
import requests
# Read from bytes in memory
with open('document.docx', 'rb') as f:
docx_bytes = BytesIO(f.read())
with docx2python(docx_bytes) as doc:
print(doc.text)
# Download and process directly
response = requests.get('https://example.com/document.docx')
docx_bytes = BytesIO(response.content)
with docx2python(docx_bytes, html=True) as doc:
print(doc.body)
```
--------------------------------
### Extract and Save Images from DOCX
Source: https://context7.com/shayhill/docx2python/llms.txt
Demonstrates how to extract images from a Word document either by saving them directly to a directory or by accessing the raw binary data for custom processing.
```python
from docx2python import docx2python
# Save images to directory
with docx2python('document.docx') as doc:
doc.save_images('output/images/')
# Access image binary data
with docx2python('document.docx') as doc:
for name, image_bytes in doc.images.items():
with open(f'output/{name}', 'wb') as f:
f.write(image_bytes)
```
--------------------------------
### Generate HTML Document Map
Source: https://context7.com/shayhill/docx2python/llms.txt
Create an HTML visualization of the document structure, which is useful for debugging content positions and table layouts.
```python
from docx2python import docx2python
with docx2python('document.docx') as doc:
html_content = doc.html_map
with open('document_map.html', 'w') as f:
f.write(html_content)
```
--------------------------------
### Accessing DocxContent Properties
Source: https://context7.com/shayhill/docx2python/llms.txt
Shows how to access various parts of the document using properties of the DocxContent object. This includes main content areas (header, footer, body, footnotes, endnotes), run-level details, Par objects with metadata, convenience properties like all text and HTML map, document metadata, and raw image data.
```python
from docx2python import docx2python
with docx2python('document.docx') as doc:
# Main content areas (4-deep nested lists: [table][row][cell][paragraph])
header_content = doc.header # Header text
footer_content = doc.footer # Footer text
body_content = doc.body # Main document body
footnotes = doc.footnotes # Footnotes
endnotes = doc.endnotes # Endnotes
full_doc = doc.document # header + body + footer + footnotes + endnotes
# Run-level access (5-deep nested lists: [table][row][cell][paragraph][run])
header_runs = doc.header_runs # Individual text runs in header
body_runs = doc.body_runs # Individual text runs in body
document_runs = doc.document_runs # All text runs
# Par objects with full metadata (4-deep: [table][row][cell][Par])
body_pars = doc.body_pars # Par instances with styles and lineage
document_pars = doc.document_pars # All Par instances
# Convenience properties
all_text = doc.text # All text joined with "\n\n"
html_map = doc.html_map # Visual HTML map of content with indices
# Document metadata
props = doc.core_properties # {'creator': 'Author', 'lastModifiedBy': '...'}
# Images as binary data
images = doc.images # {'image1.png': b'...', 'image2.jpg': b'...'}
```
--------------------------------
### Extract Docx Content with Docx2Python
Source: https://github.com/shayhill/docx2python/blob/master/README.md
Demonstrates basic extraction of text content from a .docx file using the docx2python library. It shows how to use context management for automatic closing and manual closing of the document object.
```python
from docx2python import docx2python
# extract docx content
with docx2python('path/to/file.docx') as docx_content:
print(docx_content.text)
docx_content = docx2python('path/to/file.docx')
print(docx_content.text)
docx_content.close()
```
--------------------------------
### Navigate Document Structure with Iterators
Source: https://context7.com/shayhill/docx2python/llms.txt
Utilize helper functions to traverse the nested document structure, enumerate positions, and filter content like empty paragraphs.
```python
from docx2python import docx2python
from docx2python.iterators import iter_at_depth, enum_cells, iter_tables
with docx2python('document.docx') as doc:
for paragraph in iter_at_depth(doc.body, 4):
print(paragraph)
# Filter out empty paragraphs
tables = doc.body
for (i, j, k), cell in enum_cells(tables):
tables[i][j][k] = [p for p in cell if p]
```
--------------------------------
### Paragraph Handling with Par Instances in Docx2Python
Source: https://github.com/shayhill/docx2python/blob/master/README.md
Illustrates the use of Par instances introduced in Version 3 of docx2python. Each paragraph is returned as a Par instance, enabling the extraction of paragraph-specific properties such as heading levels.
```python
[ # document
[
[
[
Par instance,
Par instance,
Par instance
]
]
]
]
```
--------------------------------
### Access raw XML and DocxReader
Source: https://context7.com/shayhill/docx2python/llms.txt
Shows how to access the underlying XML structure of a docx file using the docx_reader. This allows for advanced manipulation of content files and saving modified documents.
```python
from docx2python import docx2python
with docx2python('document.docx') as doc:
reader = doc.docx_reader
# Access content files
for file in reader.content_files():
root = file.root_element
# Work with lxml etree elements
print(f"File: {file}")
# Get specific file types
office_doc = reader.file_of_type('officeDocument')
# Save modified document
# (After making changes to root_element)
reader.save('modified.docx')
```
--------------------------------
### Expose Intermediate XML Parsing Functionality (Python)
Source: https://github.com/shayhill/docx2python/blob/master/README.md
Docx2python v2 separates and exposes intermediate steps for navigating and extracting content from XML documents. This allows developers to iterate through the document, identify specific elements (e.g., paragraphs with particular formats), and extract their content, facilitating easier extension and custom processing.
```python
# See docx_reader.py for module details.
# See utilities.py for examples of major new features.
```
--------------------------------
### Analyze Paragraph Metadata and Styling
Source: https://context7.com/shayhill/docx2python/llms.txt
Access detailed paragraph information including styles, HTML formatting, document lineage, and list positioning using Par objects.
```python
from docx2python import docx2python
with docx2python('document.docx', html=True) as doc:
for par in doc.document_pars[0][0][0]:
print(f"Style: {par.style}")
print(f"HTML style: {par.html_style}")
print(f"Lineage: {par.lineage}")
print(f"List position: {par.list_position}")
for run in par.runs:
print(f" Run text: {run.text}")
print(f"Formatted: {par.run_strings}")
```
--------------------------------
### Export XML and Writing Functions (Python)
Source: https://github.com/shayhill/docx2python/blob/master/README.md
Docx2python v2 provides access to extracted XML files and the functions used to write them back into a DOCX. This functionality enables users to perform light editing, such as search and replace, for document templating.
```python
# Example usage (conceptual):
# from docx2python import Docx2Python
# d = Docx2Python("my_document.docx")
# extracted_xml = d.extract_xml()
# d.write_xml_to_docx(extracted_xml, "output_document.docx")
```
--------------------------------
### Extract Docx Content with Image Saving
Source: https://github.com/shayhill/docx2python/blob/master/README.md
Shows how to extract text content from a .docx file and simultaneously save any embedded images to a specified directory. This is useful for processing documents that contain visual assets.
```python
from docx2python import docx2python
# extract docx content, write images to image_directory
with docx2python('path/to/file.docx', 'path/to/image_directory') as docx_content:
print(docx_content.text)
```
--------------------------------
### Perform Text Replacement and Metadata Extraction
Source: https://context7.com/shayhill/docx2python/llms.txt
Use utility functions to perform search-and-replace operations on document templates and extract document-wide metadata like hyperlinks and headings.
```python
from docx2python.utilities import replace_docx_text, get_links, get_headings
replace_docx_text('template.docx', 'output.docx', ('#NAME#', 'John Doe'))
for href, text in get_links('document.docx'):
print(f"Link: {text} -> {href}")
for heading_runs in get_headings('document.docx'):
print(f"Heading: {''.join(heading_runs)}")
```
--------------------------------
### Handling Text Runs in Docx2Python
Source: https://github.com/shayhill/docx2python/blob/master/README.md
Demonstrates the structure of text runs within a DOCX document as parsed by docx2python. Version 2 introduced run attributes, allowing for finer-grained text extraction, useful for identifying specific formatting like italics or hyperlinks.
```python
[ # document
[
[
[
"a text run", "runs break when formatting changes",
"--", "runs break with bullets and special insertions",
]
]
]
]
```
--------------------------------
### POST /docx2python/extract
Source: https://github.com/shayhill/docx2python/blob/master/README.md
Extracts content from a provided .docx file into a structured Python object.
```APIDOC
## POST /docx2python/extract
### Description
Parses a .docx file and returns a structured object containing text, images, and metadata. Supports configuration for HTML output and table cell handling.
### Method
POST
### Endpoint
/docx2python/extract
### Parameters
#### Request Body
- **docx_path** (string) - Required - Path to the .docx file.
- **html** (boolean) - Optional - If True, exports text with HTML formatting tags. Default: False.
- **duplicate_merged_cells** (boolean) - Optional - If True, fills merged table cells with content from adjacent cells. Default: True.
### Request Example
{
"docx_path": "/path/to/document.docx",
"html": true,
"duplicate_merged_cells": true
}
### Response
#### Success Response (200)
- **body** (list) - The extracted document content structured as nested lists.
- **images** (dict) - Extracted images mapping placeholder names to binary data.
- **properties** (dict) - Document metadata such as creator and lastModifiedBy.
#### Response Example
{
"body": [[["Paragraph 1 text"]]],
"images": {"image1.jpg": "binary_data"},
"properties": {"creator": "Author Name"}
}
```
--------------------------------
### Extract Docx Content with HTML Formatting
Source: https://github.com/shayhill/docx2python/blob/master/README.md
Illustrates how to extract text content from a .docx file while converting basic font styles (like bold, italic, underline) into HTML tags. This option is useful for preserving some text formatting during extraction.
```python
from docx2python import docx2python
# extract docx content with basic font styles converted to html
with docx2python('path/to/file.docx', html=True) as docx_content:
print(docx_content.text)
```
--------------------------------
### Merging Consecutive Runs with Identical Formatting
Source: https://github.com/shayhill/docx2python/blob/master/README.md
Explains the functionality of merging consecutive text runs that have identical formatting, a feature introduced in Version 2 of docx2python. This process addresses issues where MS Word arbitrarily breaks text runs, which can hinder algorithmic text processing.
```xml
tags
for (i, j, k), cell in enum_at_depth(tables, 3):
tables[i][j][k] = "".join(["{x}".format(x) for x in cell])
# wrap each cell in tags
for (i, j), row in enum_at_depth(tables, 2):
tables[i][j] = "".join([" {x} ".format(x) for x in row])
# wrap each row in tags
for (i,), table in enum_at_depth(tables, 1):
tables[i] = "".join([" {x} ".format(x) for x in table])
# wrap each table in