### Install Mammoth
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md
Install the Mammoth library using pip.
```bash
pip install mammoth
```
--------------------------------
### Example: Using Result, map, and bind
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/results.md
This example demonstrates how to convert a DOCX file to HTML, access the converted HTML value, process any warning messages, and then chain transformations using `map` and `bind`.
```python
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
# Access converted HTML
html = result.value
# Process messages
for message in result.messages:
if message.type == "warning":
print(f"Warning: {message.message}")
# Chain operations with map
result2 = result.map(lambda html: html.upper()) # Transform HTML
# Or use with bind for operations returning Results
def add_footer(html):
new_html = html + ""
return mammoth.results.Result(new_html, [])
result3 = result.bind(add_footer)
```
--------------------------------
### Style Map File Format Example
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/cli.md
Illustrates the basic format of a style map file used with the `--style-map` option. Comments start with '#', and mappings define conversions from DOCX styles to HTML elements.
```text
# Comments start with #
p[style-name='Heading 1'] => h1:fresh
p[style-name='Normal'] => p:fresh
r => span
```
--------------------------------
### Style Map File Example
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/configuration.md
An example of a style map file used to customize the conversion process. This file maps specific document styles to HTML elements.
```text
# styles.txt
p[style-name='Heading 1'] => h1:fresh
p[style-name='Normal'] => p:fresh
r => span
```
--------------------------------
### Example: Constructing a Table
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md
Demonstrates how to create a nested structure representing a table with rows and cells containing text.
```python
import mammoth.documents as documents
table = documents.table([
documents.table_row([
documents.table_cell([documents.paragraph([documents.text("Cell 1")])]),
documents.table_cell([documents.paragraph([documents.text("Cell 2")])])
])
])
```
--------------------------------
### Custom Style Map Example
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/configuration.md
Define custom mappings for document styles to HTML elements. Each mapping should be on a new line, with comments starting with '#'.
```python
style_map = """
# Comments start with #
p[style-name='Heading 1'] => h1:fresh
p[style-name='Normal'] => p:fresh
r => span
"""
result = mammoth.convert_to_html(docx_file, style_map=style_map)
```
--------------------------------
### Example: Processing Conversion Messages
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/results.md
This example shows how to iterate through the messages associated with a conversion result and print their type and message text.
```python
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
for msg in result.messages:
print(f"[{msg.type}] {msg.message}")
# Output examples:
# [warning] Unrecognised paragraph style: Custom Style (Style ID: CustomStyle1)
# [warning] Could not find image file: image.png
```
--------------------------------
### Handle Missing Image Files
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/errors.md
This example demonstrates how to detect and report when an image file referenced in the document cannot be found. The image element is skipped, and conversion continues.
```python
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
for msg in result.messages:
if "Could not find image" in msg.message:
print(f"Missing image: {msg.message}")
# Output: Could not find image file: images/picture1.png
```
--------------------------------
### Example of Nested Elements Output
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md
Illustrates the resulting HTML structure when using the '>' operator for nested elements.
```html
Heading
```
--------------------------------
### Create Paragraph with Indentation
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md
Example of creating a paragraph with specific indentation settings. Requires importing mammoth.documents and using the paragraph_indent helper function.
```python
import mammoth.documents as documents
indent = documents.paragraph_indent(
start=720, # 0.5 inch left indent (720 twips = 1/2 inch)
end=0, # No right indent
first_line=None,
hanging=None
)
para = documents.paragraph(
[documents.text("Indented paragraph")],
indent=indent
)
```
--------------------------------
### Example of Reused Elements Output
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md
Shows the HTML output when elements are reused and merged, contrasting with the ':fresh' modifier.
```html
Heading
Content 1
Content 2
```
--------------------------------
### Create Superscript Run
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md
Example of creating a text run with superscript vertical alignment. Requires importing the mammoth.documents module.
```python
import mammoth.documents
# Create superscript run
run = mammoth.documents.run(
[mammoth.documents.text("2")],
vertical_alignment=mammoth.documents.VerticalAlignment.superscript
)
```
--------------------------------
### Creating a Style Mapping
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md
Example of creating a style mapping using the 'style' function. It associates a paragraph with a specific style name to an H1 HTML element.
```python
import mammoth.styles
import mammoth.document_matchers
import mammoth.html_paths
style = mammoth.styles.style(
document_matcher=mammoth.document_matchers.paragraph(style_name="Heading 1"),
html_path=mammoth.html_paths.path([
mammoth.html_paths.element(["h1"], fresh=True)
])
)
```
--------------------------------
### Reduce Paragraph Indentation
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/transforms.md
This example demonstrates how to reduce the indentation of paragraphs. The `reduce_indent` function halves the start indentation value if it exists, creating a new paragraph with the updated indentation.
```python
# Modify paragraph indentation
def reduce_indent(paragraph):
if paragraph.indent and paragraph.indent.start:
new_indent = paragraph.indent.copy(start=paragraph.indent.start // 2)
return paragraph.copy(indent=new_indent)
return paragraph
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(
docx_file,
transform_document=mammoth.transforms.paragraph(reduce_indent)
)
```
--------------------------------
### Create Subscript Run
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md
Example of creating a text run with subscript vertical alignment. Requires importing the mammoth.documents module.
```python
import mammoth.documents
# Create subscript run
run = mammoth.documents.run(
[mammoth.documents.text("H")],
vertical_alignment=mammoth.documents.VerticalAlignment.subscript
)
```
--------------------------------
### Customizing HTML Styles with Style Maps
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/00-START-HERE.md
Shows how to apply custom styles to HTML elements based on DOCX styles using a style map. This example targets 'Heading 1' paragraphs.
```python
style_map = "p[style-name='Heading 1'] => h1:fresh"
result = mammoth.convert_to_html(f, style_map=style_map)
```
--------------------------------
### Common Style Maps for Mammoth
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md
These examples show how to map specific .docx styles to HTML elements for headings, code blocks, and asides. They are used to customize the conversion process.
```plaintext
p[style-name='Heading 1'] => h1:fresh
p[style-name='Heading 2'] => h2:fresh
p[style-name='Heading 3'] => h3:fresh
```
```plaintext
p[style-name='Code'] => pre:separator('\n')
r[style-name='Code'] => code
```
```plaintext
p[style-name='Tip Heading'] => div.tip > h3:fresh
p[style-name='Tip Text'] => div.tip > p:fresh
```
```plaintext
p:ordered-list(1) => ol > li:fresh
p:unordered-list(1) => ul > li:fresh
```
--------------------------------
### Style Map Syntax: Basic Element Matching
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md
Examples of matching basic HTML elements and DOCX elements like paragraphs and runs in the Mammoth style map syntax.
```plaintext
# Match elements
p # Paragraph
r # Run (formatting)
table # Table
```
--------------------------------
### Style Map Syntax: HTML Element Mapping
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md
Examples of mapping DOCX elements to specific HTML elements, including options for creating fresh elements, adding classes, and attributes.
```plaintext
# HTML elements
h1 # Element
h1:fresh # Fresh element (create new)
h1.classname # With class
h1[attr='val'] # With attribute
div > h1:fresh # Nested
ul|ol > li:fresh # Alternatives
pre:separator('\n') # Separator between merged
! # Ignore (don't output)
```
--------------------------------
### Handle Style Mapping Parse Errors
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/errors.md
This example shows how to identify and report errors in style mapping syntax. Lines with parsing errors are ignored, allowing conversion to continue with valid mappings.
```python
bad_style_map = """
p[style-name='Heading'] => h1:fresh
r[invalid syntax => span
p => p:fresh
"""
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file, style_map=bad_style_map)
for msg in result.messages:
if "Did not understand" in msg.message:
print(msg.message)
# Output: Did not understand this style mapping, so ignored it: r[invalid syntax => span
```
--------------------------------
### Image Handler Function Example
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/configuration.md
Custom function to handle image conversion. It receives an Image element and must return a dictionary of HTML attributes, including 'src'.
```python
def my_image_handler(image):
return {
"src": "path/to/image.png",
"alt": "Image description",
"width": "200",
"height": "150"
}
result = mammoth.convert_to_html(
docx_file,
convert_image=mammoth.images.img_element(my_image_handler)
)
```
--------------------------------
### Handle Unrecognised Styles
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/errors.md
This example shows how to identify and potentially resolve unrecognised paragraph, run, or table styles by printing them. Unrecognised styles are still converted using default HTML elements.
```python
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
for msg in result.messages:
if "Unrecognised" in msg.message:
print(msg.message)
# Output: Unrecognised paragraph style: Custom Heading (Style ID: CustomHeading1)
```
--------------------------------
### Match paragraph by style name prefix
Source: https://github.com/mwilliamson/python-mammoth/blob/master/README.md
Match paragraphs where the style name starts with a specified prefix. Useful for grouping similar styles.
```mammoth
p[style-name^='Heading']
```
--------------------------------
### Convert DOCX to HTML with Python Mammoth
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/README.md
Use this snippet to convert a DOCX file to HTML. Ensure the 'mammoth' library is installed and the DOCX file is accessible.
```python
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
# Use result.value and result.messages
```
--------------------------------
### Create Bookmark Element
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md
Instantiate a Bookmark element. The `name` parameter is required and should be unique.
```python
mammoth.documents.Bookmark(name)
```
--------------------------------
### Document Transforms: Get Descendants
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md
Utility functions to retrieve descendants of a document object. Get all descendants or filter by type, such as runs.
```python
import mammoth.transforms
# Get descendants
descendants = mammoth.transforms.get_descendants(document)
uns = mammoth.transforms.get_descendants_of_type(document, mammoth.documents.Run)
```
--------------------------------
### Convert DOCX to HTML with Mammoth
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/README.md
Demonstrates basic conversion of a .docx file to HTML. Includes handling of conversion messages and custom style mapping.
```python
import mammoth
# Simple conversion
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value
for message in result.messages:
print(f"Warning: {message.message}")
# With custom style map
style_map = "p[style-name='Heading 1'] => h1:fresh"
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file, style_map=style_map)
```
--------------------------------
### Create a Document Programmatically
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md
Demonstrates how to create a Document object programmatically. This is useful for constructing document structures from scratch.
```python
import mammoth.documents as documents
# Create a document programmatically
doc = documents.document(
children=[
documents.paragraph([documents.text("Hello, world!")])
]
)
```
--------------------------------
### Create Complete HTML Document
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/cli.md
Constructs a full HTML document by combining the output of the mammoth CLI with standard HTML boilerplate. This ensures a complete, viewable HTML file.
```bash
{
echo ''
mammoth document.docx
echo ''
} > complete.html
```
--------------------------------
### Getting Descendants of a Specific Type
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/configuration.md
Retrieve all descendants of a given type (e.g., Run) from a Document element.
```python
# Get descendants of specific type
runs = mammoth.transforms.get_descendants_of_type(document, mammoth.documents.Run)
```
--------------------------------
### ParagraphIndent
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md
Represents paragraph indentation settings, including start, end, first line, and hanging indents.
```APIDOC
## Data Class: ParagraphIndent
### Description
Represents paragraph indentation settings.
### Fields
- **start** (int) - Left margin indent in twips (1/20th of a point)
- **end** (int) - Right margin indent in twips
- **first_line** (int) - First line indent in twips (positive or negative)
- **hanging** (int) - Hanging indent in twips (outdent of first line)
### Example
```python
import mammoth.documents as documents
indent = documents.paragraph_indent(
start=720, # 0.5 inch left indent (720 twips = 1/2 inch)
end=0, # No right indent
first_line=None,
hanging=None
)
para = documents.paragraph(
[documents.text("Indented paragraph")],
indent=indent
)
```
```
--------------------------------
### Create a Paragraph with Text
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md
Shows how to create a Paragraph object with basic text content and apply a style ID and name.
```python
import mammoth.documents as documents
# Create a paragraph with text
para = documents.paragraph([
documents.text("This is a paragraph.")
], style_id="Heading1", style_name="Heading 1")
```
--------------------------------
### Main Entry Point Functions
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/INDEX.md
These are the primary functions exposed by the `mammoth` module for converting document objects.
```APIDOC
## mammoth.convert_to_html
### Description
Converts a file-like object containing a DOCX file to HTML.
### Method
`convert_to_html(fileobj, **kwargs)`
### Parameters
- **fileobj**: A file-like object to read the DOCX content from.
- **kwargs**: Additional keyword arguments for customization.
### Response
- Returns a `Result` object containing the HTML output.
```
```APIDOC
## mammoth.convert_to_markdown
### Description
Converts a file-like object containing a DOCX file to Markdown.
### Method
`convert_to_markdown(fileobj, **kwargs)`
### Parameters
- **fileobj**: A file-like object to read the DOCX content from.
- **kwargs**: Additional keyword arguments for customization.
### Response
- Returns a `Result` object containing the Markdown output.
```
```APIDOC
## mammoth.extract_raw_text
### Description
Extracts the raw text content from a file-like object containing a DOCX file.
### Method
`extract_raw_text(fileobj)`
### Parameters
- **fileobj**: A file-like object to read the DOCX content from.
### Response
- Returns a `Result` object containing the extracted raw text.
```
```APIDOC
## mammoth.embed_style_map
### Description
Embeds a style map into a DOCX file-like object.
### Method
`embed_style_map(fileobj, style_map)`
### Parameters
- **fileobj**: A file-like object representing the DOCX file.
- **style_map**: The style map to embed.
### Response
- Returns `None`.
```
```APIDOC
## mammoth.read_embedded_style_map
### Description
Reads an embedded style map from a DOCX file-like object.
### Method
`read_embedded_style_map(fileobj)`
### Parameters
- **fileobj**: A file-like object representing the DOCX file.
### Response
- Returns the embedded style map as a string.
```
--------------------------------
### Create Formatted Text Runs
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md
Demonstrates creating `Run` objects with different formatting like bold, custom fonts, highlighting, and vertical alignment (superscript).
```python
import mammoth.documents as documents
# Create a bold run
bold_run = documents.run([documents.text("Bold text")], is_bold=True)
# Create a colored, highlighted run
fancy_run = documents.run(
[documents.text("Important")],
font="Arial",
is_bold=True,
highlight="yellow"
)
# Superscript text
super_run = documents.run(
[documents.text("2")],
vertical_alignment=documents.VerticalAlignment.superscript
)
```
--------------------------------
### Run Class Constructor
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md
Represents a run of text with consistent formatting properties. Use this to construct text runs with various formatting options.
```python
mammoth.documents.Run(children, style_id=None, style_name=None, is_bold=None,
is_italic=None, is_underline=None, is_strikethrough=None,
is_all_caps=None, is_small_caps=None, vertical_alignment=None,
font=None, font_size=None, highlight=None)
```
--------------------------------
### Normalize Fonts in Runs
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/transforms.md
Example of a transform function that normalizes specific fonts to a standard one. This function is passed to `mammoth.transforms.run`.
```python
def normalize_font(run):
if run.font and run.font.lower() in ["times new roman", "courier"]:
return run.copy(font="Arial")
return run
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(
docx_file,
transform_document=mammoth.transforms.run(normalize_font)
)
```
--------------------------------
### Remove Bold Formatting from Runs
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/transforms.md
Example of a transform function that removes bold formatting from all runs. This function is passed to `mammoth.transforms.run`.
```python
def remove_bold(run):
return run.copy(is_bold=False)
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(
docx_file,
transform_document=mammoth.transforms.run(remove_bold)
)
```
--------------------------------
### Basic Document Conversion to HTML
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/00-START-HERE.md
Demonstrates the fundamental process of converting a DOCX file to HTML using Mammoth. Ensure the 'document.docx' file exists in the same directory.
```python
import mammoth
with open("document.docx", "rb") as f:
result = mammoth.convert_to_html(f)
print(result.value)
```
--------------------------------
### Handle Images with CLI
Source: https://github.com/mwilliamson/python-mammoth/blob/master/README.md
Convert a DOCX file and save images to a specified directory. Existing files in the output directory will be overwritten.
```bash
mammoth document.docx --output-dir=output-dir
```
--------------------------------
### Apply Custom Styles with CLI
Source: https://github.com/mwilliamson/python-mammoth/blob/master/README.md
Convert a DOCX file to HTML using a custom style map defined in a separate file.
```bash
mammoth document.docx output.html --style-map=custom-style-map
```
--------------------------------
### Mammoth Main Entry Point Functions
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/INDEX.md
Use these functions for direct document conversion and text extraction. They accept a file-like object and optional keyword arguments for customization.
```python
import mammoth
# Functions
mammoth.convert_to_html(fileobj, **kwargs) # → Result
mammoth.convert_to_markdown(fileobj, **kwargs) # → Result
mammoth.extract_raw_text(fileobj) # → Result
mammoth.embed_style_map(fileobj, style_map) # → None
mammoth.read_embedded_style_map(fileobj) # → str
```
--------------------------------
### Transforming Runs to Remove Bold
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/configuration.md
Apply a custom function to each run to modify its formatting. This example removes bold formatting from runs.
```python
# Apply function to each run
def my_run_transform(run):
if run.is_bold:
return run.copy(is_bold=False)
return run
transform = mammoth.transforms.run(my_run_transform)
```
--------------------------------
### Get Descendants of a Specific Type
Source: https://github.com/mwilliamson/python-mammoth/blob/master/README.md
Retrieves all descendant elements of a specified type from a given element. This is a utility function for custom document transformations.
```python
import mammoth.documents
import mammoth.transforms
runs = mammoth.transforms.get_descendants_of_type(paragraph, documents.Run);
```
--------------------------------
### Document Transforms: Paragraphs
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md
Apply custom transformations to paragraphs during document conversion. This example modifies paragraphs with center alignment to have a 'Heading' style.
```python
import mammoth.transforms
# Transform paragraphs
def transform_para(para):
if para.alignment == "center":
return para.copy(style_name="Heading")
return para
result = mammoth.convert_to_html(
f,
transform_document=mammoth.transforms.paragraph(transform_para)
)
```
--------------------------------
### Mammoth CLI Usage
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/cli.md
Basic syntax for using the mammoth command to convert .docx files. Specify input and optionally output paths.
```bash
mammoth [OPTIONS] docx-path [output-path]
```
--------------------------------
### Mark Highlighted Text in Runs
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/transforms.md
Example of a transform function that adds a marker (makes text bold) to highlighted text. This function is passed to `mammoth.transforms.run`.
```python
def mark_highlights(run):
if run.highlight:
text_nodes = mammoth.transforms.get_descendants_of_type(
run, mammoth.documents.Text
)
if text_nodes:
# Wrap the content
return run.copy(is_bold=True) # Example: make highlighted text bold
return run
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(
docx_file,
transform_document=mammoth.transforms.run(mark_highlights)
)
```
--------------------------------
### ParagraphIndent Data Class
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md
Represents paragraph indentation settings, including start, end, first line, and hanging indents. Indent values are in twips.
```python
@cobble.data
class ParagraphIndent(object):
start = cobble.field() # Start indent (left margin)
end = cobble.field() # End indent (right margin)
first_line = cobble.field() # First line indent
hanging = cobble.field() # Hanging indent
```
--------------------------------
### Write HTML Output as UTF-8
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/configuration.md
Demonstrates writing the converted HTML content to a file using UTF-8 encoding, either in text or binary mode.
```python
with open("output.html", "w", encoding="utf-8") as f:
f.write(result.value)
```
```python
with open("output.html", "wb") as f:
f.write(result.value.encode("utf-8"))
```
--------------------------------
### Mammoth CLI: Image Handling
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md
Convert a DOCX file to HTML and save images to a specified output directory using the Mammoth CLI.
```bash
# Convert with images to separate directory
mammoth document.docx --output-dir ./output
```
--------------------------------
### Element is_void Method
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md
Checks if an HTML element is a void element, meaning it does not require a closing tag. Examples include br, hr, img, and input.
```python
def is_void(self):
"""Check if element is a void element (no closing tag required)."""
```
--------------------------------
### Style Map Syntax: Formatting and Highlighting
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md
Shows how to match common text formatting like bold, italic, underline, strikethrough, and highlights in the Mammoth style map syntax.
```plaintext
b / i / u / strike # Bold, italic, underline, strikethrough
highlight # Highlight any/specific color
```
--------------------------------
### Document Transforms: Runs
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md
Apply custom transformations to runs (text segments) within the document. This example applies a 'code' style to runs with monospace fonts.
```python
import mammoth.transforms
# Transform runs
def transform_run(run):
if run.font and "monospace" in run.font.lower():
return run.copy(style_id="code")
return run
result = mammoth.convert_to_html(
f,
transform_document=mammoth.transforms.run(transform_run)
)
```
--------------------------------
### Chaining Mammoth Operations
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md
Demonstrates how to chain conversion and transformation operations on Mammoth results. Use `map` for simple transformations and `bind` for operations that return another Result.
```python
result = mammoth.convert_to_html(docx_file)
# Transform result
result2 = result.map(lambda html: html.upper())
# Chain with bind
def add_header(html):
new_html = "Document
" + html
return mammoth.results.Result(new_html, [])
result3 = result.bind(add_header)
```
--------------------------------
### Get All Descendants of an Element
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/transforms.md
Retrieves all child elements, grandchildren, and so on, of a given document element in depth-first order. Useful for processing all content within a specific part of the document.
```python
import mammoth
import mammoth.transforms
import mammoth.documents
def transform_document(document):
# Get all text in the document
all_descendants = mammoth.transforms.get_descendants(document)
text_nodes = [d for d in all_descendants if isinstance(d, mammoth.documents.Text)]
total_chars = sum(len(text.value) for text in text_nodes)
print(f"Total characters: {total_chars}")
return document
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(
docx_file,
transform_document=transform_document
)
```
--------------------------------
### Style Map Syntax: Matching by Attributes and Styles
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md
Demonstrates how to match DOCX elements based on their style names, IDs, and list levels using Mammoth's style map syntax.
```plaintext
p[style-name='H1'] # By style name
p.StyleId # By style ID
p[style-name^='Head'] # Prefix match
p:ordered-list(1) # List level
```
--------------------------------
### Create Plain Text Element
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md
Use this to create a plain text element. Ensure the 'mammoth.documents' module is imported.
```python
import mammoth.documents as documents
text = documents.text("Hello, world!")
```
--------------------------------
### Convert DOCX and Pipe to Sed
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/cli.md
Demonstrates piping the output of the mammoth conversion to another command-line tool like `sed` for further processing, such as modifying HTML tags.
```bash
mammoth document.docx | sed 's//
/' > output.html
```
--------------------------------
### Mammoth CLI: Basic HTML Conversion
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md
Convert a DOCX file to an HTML file using the Mammoth command-line interface.
```bash
# Convert to HTML
mammoth document.docx output.html
```
--------------------------------
### Convert Center-Aligned Paragraphs to H2 Headings
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/transforms.md
This example demonstrates how to use the `paragraph` transform to convert center-aligned paragraphs without a specific style ID into H2 headings. It requires the `mammoth` library to be imported.
```python
import mammoth
import mammoth.transforms
# Convert center-aligned paragraphs to h2 headings
def transform_paragraph(paragraph):
if paragraph.alignment == "center" and not paragraph.style_id:
return paragraph.copy(style_id="Heading2", style_name="Heading 2")
return paragraph
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(
docx_file,
transform_document=mammoth.transforms.paragraph(transform_paragraph)
)
```
--------------------------------
### Add Prefix to Specific Paragraph Style
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/transforms.md
This example shows how to add a 'NOTE: ' prefix to all paragraphs with the style name 'Note'. The `add_prefix` function modifies the paragraph's children by prepending a new run containing the prefix.
```python
# Add a prefix to all paragraphs with a specific style
def add_prefix(paragraph):
if paragraph.style_name == "Note":
run = mammoth.documents.run([mammoth.documents.text("NOTE: ")])
return paragraph.copy(children=[run] + paragraph.children)
return paragraph
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(
docx_file,
transform_document=mammoth.transforms.paragraph(add_prefix)
)
```
--------------------------------
### Sanitize HTML Output
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/errors.md
Sanitize HTML generated by Mammoth.js before embedding it in web pages to prevent Cross-Site Scripting (XSS) attacks. This example uses the 'bleach' library to clean the HTML, allowing only specified safe tags.
```python
from html import escape
import bleach
html = mammoth.convert_to_html(docx_file).value
safe_html = bleach.clean(html, tags=['p', 'a', 'h1', 'h2', 'strong', 'em'])
```
--------------------------------
### Sanitize HTML Output with Bleach
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/README.md
Sanitize generated HTML to remove potentially harmful content like script tags and javascript: URLs before embedding in web pages. This example shows how to use the 'bleach' library to clean HTML, allowing only specific tags.
```python
from bleach import clean
html = clean(result.value, tags=['p', 'a', 'h1', 'h2'])
```
--------------------------------
### Create a Paragraph with Multiple Runs
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md
Illustrates creating a Paragraph with multiple runs, including one styled as bold, to represent more complex text formatting.
```python
import mammoth.documents as documents
# Create a paragraph with multiple runs
para = documents.paragraph([
documents.run([documents.text("Bold ")], is_bold=True),
documents.run([documents.text("text")])
])
```
--------------------------------
### Process Multiple DOCX Files in a Loop
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/cli.md
Iterates through all .docx files in the current directory and converts each one to its corresponding .html file using the mammoth CLI.
```bash
for file in *.docx; do
mammoth "$file" "${file%.docx}.html"
done
```
--------------------------------
### Match Bold Text
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md
Use 'b' to match text formatted as bold.
```plaintext
b
```
--------------------------------
### Mammoth CLI: Output to Standard Output
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md
Pipe the HTML output of a DOCX conversion to standard output using the Mammoth CLI, useful for further processing.
```bash
# Output to stdout
mammoth document.docx | head -20
```
--------------------------------
### Using Default Image Handler (Data URIs)
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/configuration.md
Explicitly use the default image handler for converting images to data URIs.
```python
# Default: data URIs
result = mammoth.convert_to_html(docx_file)
```
```python
# Explicit use of default
result = mammoth.convert_to_html(
docx_file,
convert_image=mammoth.images.data_uri
)
```
--------------------------------
### Fail Conversion on Any Warnings
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/errors.md
Implement a strategy to halt the conversion process if any warnings are generated. This ensures that all potential issues are addressed before proceeding.
```python
result = mammoth.convert_to_html(docx_file)
if result.messages:
raise Exception(f"Conversion had issues:\n" +
"\n".join(m.message for m in result.messages))
html = result.value
```
--------------------------------
### Mammoth CLI: Custom Styles
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md
Convert a DOCX file to HTML using a custom style map defined in a separate file via the Mammoth CLI.
```bash
# Convert with custom styles
mammoth document.docx output.html --style-map styles.txt
```
--------------------------------
### Style Mapping: Word Styles to HTML
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/README.md
Demonstrates the style mapping system that converts Word document styles into corresponding HTML elements. This process focuses on semantic structure rather than exact visual replication.
```plaintext
Word Document Style → Style Matcher → HTML Path → HTML Output
"Heading 1" style → p[style-name='Heading 1'] → h1:fresh →
...
"Normal" style → p[style-name='Normal'] → p:fresh → ...
```
--------------------------------
### Process Multiple DOCX Files
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md
Iterates through a directory, converts all .docx files to HTML, and saves them to another directory. Ensure input and output directories exist.
```python
import os
import mammoth
for filename in os.listdir("docx_folder"):
if filename.endswith(".docx"):
with open(f"docx_folder/{filename}", "rb") as f:
result = mammoth.convert_to_html(f)
with open(f"html_folder/{filename[:-5]}.html", "w") as out:
out.write(result.value)
```
--------------------------------
### StringMatcher
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md
Matches style names using string matching operations, supporting exact and prefix matches in a case-insensitive manner.
```APIDOC
## StringMatcher
### Description
Matches style names using string matching operations.
### Fields
- **operator** (callable) - Function that performs comparison (case-insensitive)
- **value** (str) - The string pattern to match
### Factory Functions
- **equal_to(value)**: Create a matcher for exact match (case-insensitive)
- **starts_with(value)**: Create a matcher for prefix match (case-insensitive)
### Example
```python
import mammoth.document_matchers as matchers
# Match styles exactly
exact = matchers.equal_to("Heading 1")
# Match styles by prefix
prefix = matchers.starts_with("Heading")
```
```
--------------------------------
### Match Any Paragraph
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md
Use 'p' to match any paragraph element in the document.
```plaintext
p
```
--------------------------------
### Match Run by Style ID
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md
Match runs using their style ID for precise formatting selection.
```plaintext
r.Strong
r.Emphasis
```
--------------------------------
### RunMatcher
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md
Represents a pattern for matching run elements based on style ID and style name.
```APIDOC
## RunMatcher
### Description
Matches run elements based on their style ID and style name.
### Fields
- **style_id** (str) - Required - Run style ID to match.
- **style_name** (StringMatcher) - Required - Run style name pattern to match.
```
--------------------------------
### ImageWriter Class Initialization
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/cli.md
Python code snippet showing the initialization of the internal ImageWriter class, which is used by the CLI for handling image extraction during conversion.
```python
class ImageWriter(object):
def __init__(self, output_dir):
self._output_dir = output_dir
self._image_number = 1
def __call__(self, image):
# Saves image to output_dir and returns src path
```
--------------------------------
### Match Any Table
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md
Use 'table' to match any table element in the document.
```plaintext
table
```
--------------------------------
### Convert DOCX to HTML with Mammoth
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/00-START-HERE.md
Use this snippet to convert a .docx file to HTML. It includes basic error handling for warnings generated during the conversion process.
```python
import mammoth
# Convert .docx to HTML
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value
# Check for warnings
for message in result.messages:
print(f"Warning: {message.message}")
```
--------------------------------
### Checking Conversion Messages
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/00-START-HERE.md
Shows how to iterate through and print any messages (warnings or errors) generated during the conversion process. These messages provide insights into potential issues.
```python
for msg in result.messages:
print(f"[{msg.type}] {msg.message}")
```
--------------------------------
### Select a fresh H1 element
Source: https://github.com/mwilliamson/python-mammoth/blob/master/README.md
Use the :fresh pseudo-class to require that the H1 element is fresh.
```mammoth
h1:fresh
```
--------------------------------
### Match All Caps Text
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md
Use 'all-caps' to match text formatted in all capital letters.
```plaintext
all-caps
```
--------------------------------
### Create Checkbox Element
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md
Instantiate a Checkbox element. The `checked` parameter determines its initial state.
```python
mammoth.documents.Checkbox(checked)
```
--------------------------------
### Convert DOCX to Markdown via CLI (Deprecated)
Source: https://github.com/mwilliamson/python-mammoth/blob/master/README.md
Demonstrates the deprecated method of generating Markdown output directly from a DOCX file using the CLI.
```bash
mammoth document.docx --output-format=markdown
```
--------------------------------
### Custom Style Map for DOCX to HTML
Source: https://github.com/mwilliamson/python-mammoth/blob/master/README.md
Applies custom style mappings to convert specific .docx styles to HTML elements. User-defined mappings take precedence over defaults. The 'fresh' keyword ensures a new element is created.
```python
import mammoth
style_map = """
p[style-name='Section Title'] => h1:fresh
p[style-name='Subsection Title'] => h2:fresh
"""
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file, style_map=style_map)
```
--------------------------------
### Run Class
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md
Represents a run of text with consistent formatting properties. It can contain child elements like Text and Tab, and has properties for styling such as bold, italic, underline, font, and color.
```APIDOC
## Run Class
### Description
Represents a run of text with consistent formatting properties.
### Class Definition
```python
@cobble.data
class Run(HasChildren):
children = cobble.field() # list of child elements (Text, Tab, etc.)
style_id = cobble.field()
style_name = cobble.field()
is_bold = cobble.field() # bool
is_italic = cobble.field() # bool
is_underline = cobble.field() # bool
is_strikethrough = cobble.field() # bool
is_all_caps = cobble.field() # bool
is_small_caps = cobble.field() # bool
vertical_alignment = cobble.field() # "baseline", "superscript", "subscript"
font = cobble.field() # font name string
font_size = cobble.field() # font size in half-points
highlight = cobble.field() # highlight color string
```
### Properties
| Property | Type | Description |
|----------|------|-------------|
| children | list | Text nodes, tabs, or other inline content |
| style_id | str | The run style ID |
| style_name | str | The run style name |
| is_bold | bool | Whether text is bold |
| is_italic | bool | Whether text is italic |
| is_underline | bool | Whether text is underlined |
| is_strikethrough | bool | Whether text has strikethrough |
| is_all_caps | bool | Whether text is uppercase |
| is_small_caps | bool | Whether text uses small capitals |
| vertical_alignment | str | "baseline", "superscript", or "subscript" |
| font | str | Font family name |
| font_size | int | Size in half-points (e.g., 24 = 12pt) |
| highlight | str | Highlight color (e.g., "yellow", "blue") |
### Example
```python
import mammoth.documents as documents
# Create a bold run
bold_run = documents.run([documents.text("Bold text")], is_bold=True)
# Create a colored, highlighted run
fancy_run = documents.run(
[documents.text("Important")],
font="Arial",
is_bold=True,
highlight="yellow"
)
# Superscript text
super_run = documents.run(
[documents.text("2")],
vertical_alignment=documents.VerticalAlignment.superscript
)
```
```
--------------------------------
### Basic DOCX to HTML Conversion
Source: https://github.com/mwilliamson/python-mammoth/blob/master/README.md
Converts a .docx file to HTML using a file-like object. Ensure the file is opened in binary mode. The result object contains the generated HTML and any conversion messages.
```python
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value # The generated HTML
messages = result.messages # Any messages, such as warnings during conversion
```
--------------------------------
### Mammoth Images Module
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/INDEX.md
Utilities for handling image conversion within Mammoth.
```APIDOC
## mammoth.images.img_element
### Description
Creates an image converter function.
### Method
`img_element(func)`
### Parameters
- **func**: A function to use for converting image elements.
```
```APIDOC
## mammoth.images.data_uri
### Description
Default image converter that generates data URIs.
### Usage
This is a default converter function.
```
--------------------------------
### Create a Table Element
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md
Constructs a Table element with specified children and optional style information. Children should be TableRow objects.
```python
mammoth.documents.Table(children, style_id=None, style_name=None)
```
--------------------------------
### Match Run by Style Name
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md
Match runs based on their specific style name, such as 'Strong' or 'Emphasis'.
```plaintext
r[style-name='Strong']
r[style-name='Emphasis']
```
--------------------------------
### Convert DOCX to HTML with Custom Options
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md
Use this snippet to convert a DOCX file object to HTML with various customization options. It shows how to specify custom styles, image handling, ID prefixes, and other conversion parameters.
```python
result = mammoth.convert_to_html(
fileobj,
style_map="p => p:fresh", # Custom styles
convert_image=mammoth.images.data_uri, # Image handler
id_prefix="doc_", # ID prefix
ignore_empty_paragraphs=True, # Skip empty paras
include_embedded_style_map=True, # Use embedded map
include_default_style_map=True, # Use defaults
external_file_access=False, # Secure
transform_document=None, # Pre-conversion transform
)
```
--------------------------------
### Convert DOCX with Custom Style Map
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/cli.md
Converts a .docx file to HTML using a custom style map file to control the output's HTML structure and styling.
```bash
mammoth document.docx output.html --style-map styles.txt
```
--------------------------------
### Basic Style Mapping
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md
Map standard Word styles like Title, Subtitle, and Body to HTML elements.
```style-map
p[style-name='Title'] => h1:fresh
p[style-name='Subtitle'] => h2:fresh
p[style-name='Body'] => p:fresh
```
--------------------------------
### Handling Images During Conversion
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/00-START-HERE.md
Illustrates how to process images within a DOCX document during conversion. The `save_image` function should be implemented to handle image saving logic.
```python
def save_image(image):
with image.open() as img:
# Save to file
return {"src": "path/to/image.png"}
result = mammoth.convert_to_html(
f,
convert_image=mammoth.images.img_element(save_image)
)
```
--------------------------------
### Mammoth Useful Constants
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md
Lists useful constants provided by the Mammoth library for document formatting and image conversion.
```python
# Vertical alignment
mammoth.documents.VerticalAlignment.baseline
mammoth.documents.VerticalAlignment.superscript
mammoth.documents.VerticalAlignment.subscript
# Break types
mammoth.documents.line_break
mammoth.documents.page_break
mammoth.documents.column_break
# Image converters
mammoth.images.data_uri # Default: embed as data URIs
```
--------------------------------
### Create a Note Element
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md
Represents a footnote or endnote. Requires note type, a unique ID, and the body content.
```python
mammoth.documents.Note(note_type, note_id, body)
```
--------------------------------
### StringMatcher Factory Functions
Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md
Create matchers for style names using exact or prefix string comparisons. These are used in style mappings for document parsing.
```python
import mammoth.document_matchers as matchers
# Match styles exactly
exact = matchers.equal_to("Heading 1")
# Match styles by prefix
prefix = matchers.starts_with("Heading")
# Used in style mappings (via parser)
# p[style-name='Heading 1'] => h1:fresh
# p[style-name^='Heading'] => h2:fresh
```