### Install Mammoth Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md Install the Mammoth library using pip. ```bash pip install mammoth ``` -------------------------------- ### Example: Using Result, map, and bind Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/results.md This example demonstrates how to convert a DOCX file to HTML, access the converted HTML value, process any warning messages, and then chain transformations using `map` and `bind`. ```python import mammoth with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html(docx_file) # Access converted HTML html = result.value # Process messages for message in result.messages: if message.type == "warning": print(f"Warning: {message.message}") # Chain operations with map result2 = result.map(lambda html: html.upper()) # Transform HTML # Or use with bind for operations returning Results def add_footer(html): new_html = html + "" return mammoth.results.Result(new_html, []) result3 = result.bind(add_footer) ``` -------------------------------- ### Style Map File Format Example Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/cli.md Illustrates the basic format of a style map file used with the `--style-map` option. Comments start with '#', and mappings define conversions from DOCX styles to HTML elements. ```text # Comments start with # p[style-name='Heading 1'] => h1:fresh p[style-name='Normal'] => p:fresh r => span ``` -------------------------------- ### Style Map File Example Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/configuration.md An example of a style map file used to customize the conversion process. This file maps specific document styles to HTML elements. ```text # styles.txt p[style-name='Heading 1'] => h1:fresh p[style-name='Normal'] => p:fresh r => span ``` -------------------------------- ### Example: Constructing a Table Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md Demonstrates how to create a nested structure representing a table with rows and cells containing text. ```python import mammoth.documents as documents table = documents.table([ documents.table_row([ documents.table_cell([documents.paragraph([documents.text("Cell 1")])]), documents.table_cell([documents.paragraph([documents.text("Cell 2")])]) ]) ]) ``` -------------------------------- ### Custom Style Map Example Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/configuration.md Define custom mappings for document styles to HTML elements. Each mapping should be on a new line, with comments starting with '#'. ```python style_map = """ # Comments start with # p[style-name='Heading 1'] => h1:fresh p[style-name='Normal'] => p:fresh r => span """ result = mammoth.convert_to_html(docx_file, style_map=style_map) ``` -------------------------------- ### Example: Processing Conversion Messages Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/results.md This example shows how to iterate through the messages associated with a conversion result and print their type and message text. ```python import mammoth with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html(docx_file) for msg in result.messages: print(f"[{msg.type}] {msg.message}") # Output examples: # [warning] Unrecognised paragraph style: Custom Style (Style ID: CustomStyle1) # [warning] Could not find image file: image.png ``` -------------------------------- ### Handle Missing Image Files Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/errors.md This example demonstrates how to detect and report when an image file referenced in the document cannot be found. The image element is skipped, and conversion continues. ```python with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html(docx_file) for msg in result.messages: if "Could not find image" in msg.message: print(f"Missing image: {msg.message}") # Output: Could not find image file: images/picture1.png ``` -------------------------------- ### Example of Nested Elements Output Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md Illustrates the resulting HTML structure when using the '>' operator for nested elements. ```html

Heading

Content

``` -------------------------------- ### Create Paragraph with Indentation Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md Example of creating a paragraph with specific indentation settings. Requires importing mammoth.documents and using the paragraph_indent helper function. ```python import mammoth.documents as documents indent = documents.paragraph_indent( start=720, # 0.5 inch left indent (720 twips = 1/2 inch) end=0, # No right indent first_line=None, hanging=None ) para = documents.paragraph( [documents.text("Indented paragraph")], indent=indent ) ``` -------------------------------- ### Example of Reused Elements Output Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md Shows the HTML output when elements are reused and merged, contrasting with the ':fresh' modifier. ```html

Heading

Content 1

Content 2

``` -------------------------------- ### Create Superscript Run Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md Example of creating a text run with superscript vertical alignment. Requires importing the mammoth.documents module. ```python import mammoth.documents # Create superscript run run = mammoth.documents.run( [mammoth.documents.text("2")], vertical_alignment=mammoth.documents.VerticalAlignment.superscript ) ``` -------------------------------- ### Creating a Style Mapping Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md Example of creating a style mapping using the 'style' function. It associates a paragraph with a specific style name to an H1 HTML element. ```python import mammoth.styles import mammoth.document_matchers import mammoth.html_paths style = mammoth.styles.style( document_matcher=mammoth.document_matchers.paragraph(style_name="Heading 1"), html_path=mammoth.html_paths.path([ mammoth.html_paths.element(["h1"], fresh=True) ]) ) ``` -------------------------------- ### Reduce Paragraph Indentation Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/transforms.md This example demonstrates how to reduce the indentation of paragraphs. The `reduce_indent` function halves the start indentation value if it exists, creating a new paragraph with the updated indentation. ```python # Modify paragraph indentation def reduce_indent(paragraph): if paragraph.indent and paragraph.indent.start: new_indent = paragraph.indent.copy(start=paragraph.indent.start // 2) return paragraph.copy(indent=new_indent) return paragraph with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html( docx_file, transform_document=mammoth.transforms.paragraph(reduce_indent) ) ``` -------------------------------- ### Create Subscript Run Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md Example of creating a text run with subscript vertical alignment. Requires importing the mammoth.documents module. ```python import mammoth.documents # Create subscript run run = mammoth.documents.run( [mammoth.documents.text("H")], vertical_alignment=mammoth.documents.VerticalAlignment.subscript ) ``` -------------------------------- ### Customizing HTML Styles with Style Maps Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/00-START-HERE.md Shows how to apply custom styles to HTML elements based on DOCX styles using a style map. This example targets 'Heading 1' paragraphs. ```python style_map = "p[style-name='Heading 1'] => h1:fresh" result = mammoth.convert_to_html(f, style_map=style_map) ``` -------------------------------- ### Common Style Maps for Mammoth Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md These examples show how to map specific .docx styles to HTML elements for headings, code blocks, and asides. They are used to customize the conversion process. ```plaintext p[style-name='Heading 1'] => h1:fresh p[style-name='Heading 2'] => h2:fresh p[style-name='Heading 3'] => h3:fresh ``` ```plaintext p[style-name='Code'] => pre:separator('\n') r[style-name='Code'] => code ``` ```plaintext p[style-name='Tip Heading'] => div.tip > h3:fresh p[style-name='Tip Text'] => div.tip > p:fresh ``` ```plaintext p:ordered-list(1) => ol > li:fresh p:unordered-list(1) => ul > li:fresh ``` -------------------------------- ### Style Map Syntax: Basic Element Matching Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md Examples of matching basic HTML elements and DOCX elements like paragraphs and runs in the Mammoth style map syntax. ```plaintext # Match elements p # Paragraph r # Run (formatting) table # Table ``` -------------------------------- ### Style Map Syntax: HTML Element Mapping Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md Examples of mapping DOCX elements to specific HTML elements, including options for creating fresh elements, adding classes, and attributes. ```plaintext # HTML elements h1 # Element h1:fresh # Fresh element (create new) h1.classname # With class h1[attr='val'] # With attribute div > h1:fresh # Nested ul|ol > li:fresh # Alternatives pre:separator('\n') # Separator between merged ! # Ignore (don't output) ``` -------------------------------- ### Handle Style Mapping Parse Errors Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/errors.md This example shows how to identify and report errors in style mapping syntax. Lines with parsing errors are ignored, allowing conversion to continue with valid mappings. ```python bad_style_map = """ p[style-name='Heading'] => h1:fresh r[invalid syntax => span p => p:fresh """ with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html(docx_file, style_map=bad_style_map) for msg in result.messages: if "Did not understand" in msg.message: print(msg.message) # Output: Did not understand this style mapping, so ignored it: r[invalid syntax => span ``` -------------------------------- ### Image Handler Function Example Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/configuration.md Custom function to handle image conversion. It receives an Image element and must return a dictionary of HTML attributes, including 'src'. ```python def my_image_handler(image): return { "src": "path/to/image.png", "alt": "Image description", "width": "200", "height": "150" } result = mammoth.convert_to_html( docx_file, convert_image=mammoth.images.img_element(my_image_handler) ) ``` -------------------------------- ### Handle Unrecognised Styles Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/errors.md This example shows how to identify and potentially resolve unrecognised paragraph, run, or table styles by printing them. Unrecognised styles are still converted using default HTML elements. ```python import mammoth with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html(docx_file) for msg in result.messages: if "Unrecognised" in msg.message: print(msg.message) # Output: Unrecognised paragraph style: Custom Heading (Style ID: CustomHeading1) ``` -------------------------------- ### Match paragraph by style name prefix Source: https://github.com/mwilliamson/python-mammoth/blob/master/README.md Match paragraphs where the style name starts with a specified prefix. Useful for grouping similar styles. ```mammoth p[style-name^='Heading'] ``` -------------------------------- ### Convert DOCX to HTML with Python Mammoth Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/README.md Use this snippet to convert a DOCX file to HTML. Ensure the 'mammoth' library is installed and the DOCX file is accessible. ```python import mammoth with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html(docx_file) # Use result.value and result.messages ``` -------------------------------- ### Create Bookmark Element Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md Instantiate a Bookmark element. The `name` parameter is required and should be unique. ```python mammoth.documents.Bookmark(name) ``` -------------------------------- ### Document Transforms: Get Descendants Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md Utility functions to retrieve descendants of a document object. Get all descendants or filter by type, such as runs. ```python import mammoth.transforms # Get descendants descendants = mammoth.transforms.get_descendants(document) uns = mammoth.transforms.get_descendants_of_type(document, mammoth.documents.Run) ``` -------------------------------- ### Convert DOCX to HTML with Mammoth Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/README.md Demonstrates basic conversion of a .docx file to HTML. Includes handling of conversion messages and custom style mapping. ```python import mammoth # Simple conversion with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html(docx_file) html = result.value for message in result.messages: print(f"Warning: {message.message}") # With custom style map style_map = "p[style-name='Heading 1'] => h1:fresh" with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html(docx_file, style_map=style_map) ``` -------------------------------- ### Create a Document Programmatically Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md Demonstrates how to create a Document object programmatically. This is useful for constructing document structures from scratch. ```python import mammoth.documents as documents # Create a document programmatically doc = documents.document( children=[ documents.paragraph([documents.text("Hello, world!")]) ] ) ``` -------------------------------- ### Create Complete HTML Document Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/cli.md Constructs a full HTML document by combining the output of the mammoth CLI with standard HTML boilerplate. This ensures a complete, viewable HTML file. ```bash { echo '' mammoth document.docx echo '' } > complete.html ``` -------------------------------- ### Getting Descendants of a Specific Type Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/configuration.md Retrieve all descendants of a given type (e.g., Run) from a Document element. ```python # Get descendants of specific type runs = mammoth.transforms.get_descendants_of_type(document, mammoth.documents.Run) ``` -------------------------------- ### ParagraphIndent Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md Represents paragraph indentation settings, including start, end, first line, and hanging indents. ```APIDOC ## Data Class: ParagraphIndent ### Description Represents paragraph indentation settings. ### Fields - **start** (int) - Left margin indent in twips (1/20th of a point) - **end** (int) - Right margin indent in twips - **first_line** (int) - First line indent in twips (positive or negative) - **hanging** (int) - Hanging indent in twips (outdent of first line) ### Example ```python import mammoth.documents as documents indent = documents.paragraph_indent( start=720, # 0.5 inch left indent (720 twips = 1/2 inch) end=0, # No right indent first_line=None, hanging=None ) para = documents.paragraph( [documents.text("Indented paragraph")], indent=indent ) ``` ``` -------------------------------- ### Create a Paragraph with Text Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md Shows how to create a Paragraph object with basic text content and apply a style ID and name. ```python import mammoth.documents as documents # Create a paragraph with text para = documents.paragraph([ documents.text("This is a paragraph.") ], style_id="Heading1", style_name="Heading 1") ``` -------------------------------- ### Main Entry Point Functions Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/INDEX.md These are the primary functions exposed by the `mammoth` module for converting document objects. ```APIDOC ## mammoth.convert_to_html ### Description Converts a file-like object containing a DOCX file to HTML. ### Method `convert_to_html(fileobj, **kwargs)` ### Parameters - **fileobj**: A file-like object to read the DOCX content from. - **kwargs**: Additional keyword arguments for customization. ### Response - Returns a `Result` object containing the HTML output. ``` ```APIDOC ## mammoth.convert_to_markdown ### Description Converts a file-like object containing a DOCX file to Markdown. ### Method `convert_to_markdown(fileobj, **kwargs)` ### Parameters - **fileobj**: A file-like object to read the DOCX content from. - **kwargs**: Additional keyword arguments for customization. ### Response - Returns a `Result` object containing the Markdown output. ``` ```APIDOC ## mammoth.extract_raw_text ### Description Extracts the raw text content from a file-like object containing a DOCX file. ### Method `extract_raw_text(fileobj)` ### Parameters - **fileobj**: A file-like object to read the DOCX content from. ### Response - Returns a `Result` object containing the extracted raw text. ``` ```APIDOC ## mammoth.embed_style_map ### Description Embeds a style map into a DOCX file-like object. ### Method `embed_style_map(fileobj, style_map)` ### Parameters - **fileobj**: A file-like object representing the DOCX file. - **style_map**: The style map to embed. ### Response - Returns `None`. ``` ```APIDOC ## mammoth.read_embedded_style_map ### Description Reads an embedded style map from a DOCX file-like object. ### Method `read_embedded_style_map(fileobj)` ### Parameters - **fileobj**: A file-like object representing the DOCX file. ### Response - Returns the embedded style map as a string. ``` -------------------------------- ### Create Formatted Text Runs Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md Demonstrates creating `Run` objects with different formatting like bold, custom fonts, highlighting, and vertical alignment (superscript). ```python import mammoth.documents as documents # Create a bold run bold_run = documents.run([documents.text("Bold text")], is_bold=True) # Create a colored, highlighted run fancy_run = documents.run( [documents.text("Important")], font="Arial", is_bold=True, highlight="yellow" ) # Superscript text super_run = documents.run( [documents.text("2")], vertical_alignment=documents.VerticalAlignment.superscript ) ``` -------------------------------- ### Run Class Constructor Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md Represents a run of text with consistent formatting properties. Use this to construct text runs with various formatting options. ```python mammoth.documents.Run(children, style_id=None, style_name=None, is_bold=None, is_italic=None, is_underline=None, is_strikethrough=None, is_all_caps=None, is_small_caps=None, vertical_alignment=None, font=None, font_size=None, highlight=None) ``` -------------------------------- ### Normalize Fonts in Runs Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/transforms.md Example of a transform function that normalizes specific fonts to a standard one. This function is passed to `mammoth.transforms.run`. ```python def normalize_font(run): if run.font and run.font.lower() in ["times new roman", "courier"]: return run.copy(font="Arial") return run with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html( docx_file, transform_document=mammoth.transforms.run(normalize_font) ) ``` -------------------------------- ### Remove Bold Formatting from Runs Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/transforms.md Example of a transform function that removes bold formatting from all runs. This function is passed to `mammoth.transforms.run`. ```python def remove_bold(run): return run.copy(is_bold=False) with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html( docx_file, transform_document=mammoth.transforms.run(remove_bold) ) ``` -------------------------------- ### Basic Document Conversion to HTML Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/00-START-HERE.md Demonstrates the fundamental process of converting a DOCX file to HTML using Mammoth. Ensure the 'document.docx' file exists in the same directory. ```python import mammoth with open("document.docx", "rb") as f: result = mammoth.convert_to_html(f) print(result.value) ``` -------------------------------- ### Handle Images with CLI Source: https://github.com/mwilliamson/python-mammoth/blob/master/README.md Convert a DOCX file and save images to a specified directory. Existing files in the output directory will be overwritten. ```bash mammoth document.docx --output-dir=output-dir ``` -------------------------------- ### Apply Custom Styles with CLI Source: https://github.com/mwilliamson/python-mammoth/blob/master/README.md Convert a DOCX file to HTML using a custom style map defined in a separate file. ```bash mammoth document.docx output.html --style-map=custom-style-map ``` -------------------------------- ### Mammoth Main Entry Point Functions Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/INDEX.md Use these functions for direct document conversion and text extraction. They accept a file-like object and optional keyword arguments for customization. ```python import mammoth # Functions mammoth.convert_to_html(fileobj, **kwargs) # → Result mammoth.convert_to_markdown(fileobj, **kwargs) # → Result mammoth.extract_raw_text(fileobj) # → Result mammoth.embed_style_map(fileobj, style_map) # → None mammoth.read_embedded_style_map(fileobj) # → str ``` -------------------------------- ### Transforming Runs to Remove Bold Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/configuration.md Apply a custom function to each run to modify its formatting. This example removes bold formatting from runs. ```python # Apply function to each run def my_run_transform(run): if run.is_bold: return run.copy(is_bold=False) return run transform = mammoth.transforms.run(my_run_transform) ``` -------------------------------- ### Get Descendants of a Specific Type Source: https://github.com/mwilliamson/python-mammoth/blob/master/README.md Retrieves all descendant elements of a specified type from a given element. This is a utility function for custom document transformations. ```python import mammoth.documents import mammoth.transforms runs = mammoth.transforms.get_descendants_of_type(paragraph, documents.Run); ``` -------------------------------- ### Document Transforms: Paragraphs Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md Apply custom transformations to paragraphs during document conversion. This example modifies paragraphs with center alignment to have a 'Heading' style. ```python import mammoth.transforms # Transform paragraphs def transform_para(para): if para.alignment == "center": return para.copy(style_name="Heading") return para result = mammoth.convert_to_html( f, transform_document=mammoth.transforms.paragraph(transform_para) ) ``` -------------------------------- ### Mammoth CLI Usage Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/cli.md Basic syntax for using the mammoth command to convert .docx files. Specify input and optionally output paths. ```bash mammoth [OPTIONS] docx-path [output-path] ``` -------------------------------- ### Mark Highlighted Text in Runs Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/transforms.md Example of a transform function that adds a marker (makes text bold) to highlighted text. This function is passed to `mammoth.transforms.run`. ```python def mark_highlights(run): if run.highlight: text_nodes = mammoth.transforms.get_descendants_of_type( run, mammoth.documents.Text ) if text_nodes: # Wrap the content return run.copy(is_bold=True) # Example: make highlighted text bold return run with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html( docx_file, transform_document=mammoth.transforms.run(mark_highlights) ) ``` -------------------------------- ### ParagraphIndent Data Class Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md Represents paragraph indentation settings, including start, end, first line, and hanging indents. Indent values are in twips. ```python @cobble.data class ParagraphIndent(object): start = cobble.field() # Start indent (left margin) end = cobble.field() # End indent (right margin) first_line = cobble.field() # First line indent hanging = cobble.field() # Hanging indent ``` -------------------------------- ### Write HTML Output as UTF-8 Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/configuration.md Demonstrates writing the converted HTML content to a file using UTF-8 encoding, either in text or binary mode. ```python with open("output.html", "w", encoding="utf-8") as f: f.write(result.value) ``` ```python with open("output.html", "wb") as f: f.write(result.value.encode("utf-8")) ``` -------------------------------- ### Mammoth CLI: Image Handling Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md Convert a DOCX file to HTML and save images to a specified output directory using the Mammoth CLI. ```bash # Convert with images to separate directory mammoth document.docx --output-dir ./output ``` -------------------------------- ### Element is_void Method Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md Checks if an HTML element is a void element, meaning it does not require a closing tag. Examples include br, hr, img, and input. ```python def is_void(self): """Check if element is a void element (no closing tag required).""" ``` -------------------------------- ### Style Map Syntax: Formatting and Highlighting Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md Shows how to match common text formatting like bold, italic, underline, strikethrough, and highlights in the Mammoth style map syntax. ```plaintext b / i / u / strike # Bold, italic, underline, strikethrough highlight # Highlight any/specific color ``` -------------------------------- ### Document Transforms: Runs Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md Apply custom transformations to runs (text segments) within the document. This example applies a 'code' style to runs with monospace fonts. ```python import mammoth.transforms # Transform runs def transform_run(run): if run.font and "monospace" in run.font.lower(): return run.copy(style_id="code") return run result = mammoth.convert_to_html( f, transform_document=mammoth.transforms.run(transform_run) ) ``` -------------------------------- ### Chaining Mammoth Operations Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md Demonstrates how to chain conversion and transformation operations on Mammoth results. Use `map` for simple transformations and `bind` for operations that return another Result. ```python result = mammoth.convert_to_html(docx_file) # Transform result result2 = result.map(lambda html: html.upper()) # Chain with bind def add_header(html): new_html = "

Document

" + html return mammoth.results.Result(new_html, []) result3 = result.bind(add_header) ``` -------------------------------- ### Get All Descendants of an Element Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/transforms.md Retrieves all child elements, grandchildren, and so on, of a given document element in depth-first order. Useful for processing all content within a specific part of the document. ```python import mammoth import mammoth.transforms import mammoth.documents def transform_document(document): # Get all text in the document all_descendants = mammoth.transforms.get_descendants(document) text_nodes = [d for d in all_descendants if isinstance(d, mammoth.documents.Text)] total_chars = sum(len(text.value) for text in text_nodes) print(f"Total characters: {total_chars}") return document with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html( docx_file, transform_document=transform_document ) ``` -------------------------------- ### Style Map Syntax: Matching by Attributes and Styles Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md Demonstrates how to match DOCX elements based on their style names, IDs, and list levels using Mammoth's style map syntax. ```plaintext p[style-name='H1'] # By style name p.StyleId # By style ID p[style-name^='Head'] # Prefix match p:ordered-list(1) # List level ``` -------------------------------- ### Create Plain Text Element Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md Use this to create a plain text element. Ensure the 'mammoth.documents' module is imported. ```python import mammoth.documents as documents text = documents.text("Hello, world!") ``` -------------------------------- ### Convert DOCX and Pipe to Sed Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/cli.md Demonstrates piping the output of the mammoth conversion to another command-line tool like `sed` for further processing, such as modifying HTML tags. ```bash mammoth document.docx | sed 's/

/

/' > output.html ``` -------------------------------- ### Mammoth CLI: Basic HTML Conversion Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md Convert a DOCX file to an HTML file using the Mammoth command-line interface. ```bash # Convert to HTML mammoth document.docx output.html ``` -------------------------------- ### Convert Center-Aligned Paragraphs to H2 Headings Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/transforms.md This example demonstrates how to use the `paragraph` transform to convert center-aligned paragraphs without a specific style ID into H2 headings. It requires the `mammoth` library to be imported. ```python import mammoth import mammoth.transforms # Convert center-aligned paragraphs to h2 headings def transform_paragraph(paragraph): if paragraph.alignment == "center" and not paragraph.style_id: return paragraph.copy(style_id="Heading2", style_name="Heading 2") return paragraph with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html( docx_file, transform_document=mammoth.transforms.paragraph(transform_paragraph) ) ``` -------------------------------- ### Add Prefix to Specific Paragraph Style Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/transforms.md This example shows how to add a 'NOTE: ' prefix to all paragraphs with the style name 'Note'. The `add_prefix` function modifies the paragraph's children by prepending a new run containing the prefix. ```python # Add a prefix to all paragraphs with a specific style def add_prefix(paragraph): if paragraph.style_name == "Note": run = mammoth.documents.run([mammoth.documents.text("NOTE: ")]) return paragraph.copy(children=[run] + paragraph.children) return paragraph with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html( docx_file, transform_document=mammoth.transforms.paragraph(add_prefix) ) ``` -------------------------------- ### Sanitize HTML Output Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/errors.md Sanitize HTML generated by Mammoth.js before embedding it in web pages to prevent Cross-Site Scripting (XSS) attacks. This example uses the 'bleach' library to clean the HTML, allowing only specified safe tags. ```python from html import escape import bleach html = mammoth.convert_to_html(docx_file).value safe_html = bleach.clean(html, tags=['p', 'a', 'h1', 'h2', 'strong', 'em']) ``` -------------------------------- ### Sanitize HTML Output with Bleach Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/README.md Sanitize generated HTML to remove potentially harmful content like script tags and javascript: URLs before embedding in web pages. This example shows how to use the 'bleach' library to clean HTML, allowing only specific tags. ```python from bleach import clean html = clean(result.value, tags=['p', 'a', 'h1', 'h2']) ``` -------------------------------- ### Create a Paragraph with Multiple Runs Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md Illustrates creating a Paragraph with multiple runs, including one styled as bold, to represent more complex text formatting. ```python import mammoth.documents as documents # Create a paragraph with multiple runs para = documents.paragraph([ documents.run([documents.text("Bold ")], is_bold=True), documents.run([documents.text("text")]) ]) ``` -------------------------------- ### Process Multiple DOCX Files in a Loop Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/cli.md Iterates through all .docx files in the current directory and converts each one to its corresponding .html file using the mammoth CLI. ```bash for file in *.docx; do mammoth "$file" "${file%.docx}.html" done ``` -------------------------------- ### Match Bold Text Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md Use 'b' to match text formatted as bold. ```plaintext b ``` -------------------------------- ### Mammoth CLI: Output to Standard Output Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md Pipe the HTML output of a DOCX conversion to standard output using the Mammoth CLI, useful for further processing. ```bash # Output to stdout mammoth document.docx | head -20 ``` -------------------------------- ### Using Default Image Handler (Data URIs) Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/configuration.md Explicitly use the default image handler for converting images to data URIs. ```python # Default: data URIs result = mammoth.convert_to_html(docx_file) ``` ```python # Explicit use of default result = mammoth.convert_to_html( docx_file, convert_image=mammoth.images.data_uri ) ``` -------------------------------- ### Fail Conversion on Any Warnings Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/errors.md Implement a strategy to halt the conversion process if any warnings are generated. This ensures that all potential issues are addressed before proceeding. ```python result = mammoth.convert_to_html(docx_file) if result.messages: raise Exception(f"Conversion had issues:\n" + "\n".join(m.message for m in result.messages)) html = result.value ``` -------------------------------- ### Mammoth CLI: Custom Styles Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md Convert a DOCX file to HTML using a custom style map defined in a separate file via the Mammoth CLI. ```bash # Convert with custom styles mammoth document.docx output.html --style-map styles.txt ``` -------------------------------- ### Style Mapping: Word Styles to HTML Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/README.md Demonstrates the style mapping system that converts Word document styles into corresponding HTML elements. This process focuses on semantic structure rather than exact visual replication. ```plaintext Word Document Style → Style Matcher → HTML Path → HTML Output "Heading 1" style → p[style-name='Heading 1'] → h1:fresh →

...

"Normal" style → p[style-name='Normal'] → p:fresh →

...

``` -------------------------------- ### Process Multiple DOCX Files Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md Iterates through a directory, converts all .docx files to HTML, and saves them to another directory. Ensure input and output directories exist. ```python import os import mammoth for filename in os.listdir("docx_folder"): if filename.endswith(".docx"): with open(f"docx_folder/{filename}", "rb") as f: result = mammoth.convert_to_html(f) with open(f"html_folder/{filename[:-5]}.html", "w") as out: out.write(result.value) ``` -------------------------------- ### StringMatcher Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md Matches style names using string matching operations, supporting exact and prefix matches in a case-insensitive manner. ```APIDOC ## StringMatcher ### Description Matches style names using string matching operations. ### Fields - **operator** (callable) - Function that performs comparison (case-insensitive) - **value** (str) - The string pattern to match ### Factory Functions - **equal_to(value)**: Create a matcher for exact match (case-insensitive) - **starts_with(value)**: Create a matcher for prefix match (case-insensitive) ### Example ```python import mammoth.document_matchers as matchers # Match styles exactly exact = matchers.equal_to("Heading 1") # Match styles by prefix prefix = matchers.starts_with("Heading") ``` ``` -------------------------------- ### Match Any Paragraph Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md Use 'p' to match any paragraph element in the document. ```plaintext p ``` -------------------------------- ### Match Run by Style ID Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md Match runs using their style ID for precise formatting selection. ```plaintext r.Strong r.Emphasis ``` -------------------------------- ### RunMatcher Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md Represents a pattern for matching run elements based on style ID and style name. ```APIDOC ## RunMatcher ### Description Matches run elements based on their style ID and style name. ### Fields - **style_id** (str) - Required - Run style ID to match. - **style_name** (StringMatcher) - Required - Run style name pattern to match. ``` -------------------------------- ### ImageWriter Class Initialization Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/cli.md Python code snippet showing the initialization of the internal ImageWriter class, which is used by the CLI for handling image extraction during conversion. ```python class ImageWriter(object): def __init__(self, output_dir): self._output_dir = output_dir self._image_number = 1 def __call__(self, image): # Saves image to output_dir and returns src path ``` -------------------------------- ### Match Any Table Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md Use 'table' to match any table element in the document. ```plaintext table ``` -------------------------------- ### Convert DOCX to HTML with Mammoth Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/00-START-HERE.md Use this snippet to convert a .docx file to HTML. It includes basic error handling for warnings generated during the conversion process. ```python import mammoth # Convert .docx to HTML with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html(docx_file) html = result.value # Check for warnings for message in result.messages: print(f"Warning: {message.message}") ``` -------------------------------- ### Checking Conversion Messages Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/00-START-HERE.md Shows how to iterate through and print any messages (warnings or errors) generated during the conversion process. These messages provide insights into potential issues. ```python for msg in result.messages: print(f"[{msg.type}] {msg.message}") ``` -------------------------------- ### Select a fresh H1 element Source: https://github.com/mwilliamson/python-mammoth/blob/master/README.md Use the :fresh pseudo-class to require that the H1 element is fresh. ```mammoth h1:fresh ``` -------------------------------- ### Match All Caps Text Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md Use 'all-caps' to match text formatted in all capital letters. ```plaintext all-caps ``` -------------------------------- ### Create Checkbox Element Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md Instantiate a Checkbox element. The `checked` parameter determines its initial state. ```python mammoth.documents.Checkbox(checked) ``` -------------------------------- ### Convert DOCX to Markdown via CLI (Deprecated) Source: https://github.com/mwilliamson/python-mammoth/blob/master/README.md Demonstrates the deprecated method of generating Markdown output directly from a DOCX file using the CLI. ```bash mammoth document.docx --output-format=markdown ``` -------------------------------- ### Custom Style Map for DOCX to HTML Source: https://github.com/mwilliamson/python-mammoth/blob/master/README.md Applies custom style mappings to convert specific .docx styles to HTML elements. User-defined mappings take precedence over defaults. The 'fresh' keyword ensures a new element is created. ```python import mammoth style_map = """ p[style-name='Section Title'] => h1:fresh p[style-name='Subsection Title'] => h2:fresh """ with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html(docx_file, style_map=style_map) ``` -------------------------------- ### Run Class Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md Represents a run of text with consistent formatting properties. It can contain child elements like Text and Tab, and has properties for styling such as bold, italic, underline, font, and color. ```APIDOC ## Run Class ### Description Represents a run of text with consistent formatting properties. ### Class Definition ```python @cobble.data class Run(HasChildren): children = cobble.field() # list of child elements (Text, Tab, etc.) style_id = cobble.field() style_name = cobble.field() is_bold = cobble.field() # bool is_italic = cobble.field() # bool is_underline = cobble.field() # bool is_strikethrough = cobble.field() # bool is_all_caps = cobble.field() # bool is_small_caps = cobble.field() # bool vertical_alignment = cobble.field() # "baseline", "superscript", "subscript" font = cobble.field() # font name string font_size = cobble.field() # font size in half-points highlight = cobble.field() # highlight color string ``` ### Properties | Property | Type | Description | |----------|------|-------------| | children | list | Text nodes, tabs, or other inline content | | style_id | str | The run style ID | | style_name | str | The run style name | | is_bold | bool | Whether text is bold | | is_italic | bool | Whether text is italic | | is_underline | bool | Whether text is underlined | | is_strikethrough | bool | Whether text has strikethrough | | is_all_caps | bool | Whether text is uppercase | | is_small_caps | bool | Whether text uses small capitals | | vertical_alignment | str | "baseline", "superscript", or "subscript" | | font | str | Font family name | | font_size | int | Size in half-points (e.g., 24 = 12pt) | | highlight | str | Highlight color (e.g., "yellow", "blue") | ### Example ```python import mammoth.documents as documents # Create a bold run bold_run = documents.run([documents.text("Bold text")], is_bold=True) # Create a colored, highlighted run fancy_run = documents.run( [documents.text("Important")], font="Arial", is_bold=True, highlight="yellow" ) # Superscript text super_run = documents.run( [documents.text("2")], vertical_alignment=documents.VerticalAlignment.superscript ) ``` ``` -------------------------------- ### Basic DOCX to HTML Conversion Source: https://github.com/mwilliamson/python-mammoth/blob/master/README.md Converts a .docx file to HTML using a file-like object. Ensure the file is opened in binary mode. The result object contains the generated HTML and any conversion messages. ```python import mammoth with open("document.docx", "rb") as docx_file: result = mammoth.convert_to_html(docx_file) html = result.value # The generated HTML messages = result.messages # Any messages, such as warnings during conversion ``` -------------------------------- ### Mammoth Images Module Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/INDEX.md Utilities for handling image conversion within Mammoth. ```APIDOC ## mammoth.images.img_element ### Description Creates an image converter function. ### Method `img_element(func)` ### Parameters - **func**: A function to use for converting image elements. ``` ```APIDOC ## mammoth.images.data_uri ### Description Default image converter that generates data URIs. ### Usage This is a default converter function. ``` -------------------------------- ### Create a Table Element Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md Constructs a Table element with specified children and optional style information. Children should be TableRow objects. ```python mammoth.documents.Table(children, style_id=None, style_name=None) ``` -------------------------------- ### Match Run by Style Name Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md Match runs based on their specific style name, such as 'Strong' or 'Emphasis'. ```plaintext r[style-name='Strong'] r[style-name='Emphasis'] ``` -------------------------------- ### Convert DOCX to HTML with Custom Options Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md Use this snippet to convert a DOCX file object to HTML with various customization options. It shows how to specify custom styles, image handling, ID prefixes, and other conversion parameters. ```python result = mammoth.convert_to_html( fileobj, style_map="p => p:fresh", # Custom styles convert_image=mammoth.images.data_uri, # Image handler id_prefix="doc_", # ID prefix ignore_empty_paragraphs=True, # Skip empty paras include_embedded_style_map=True, # Use embedded map include_default_style_map=True, # Use defaults external_file_access=False, # Secure transform_document=None, # Pre-conversion transform ) ``` -------------------------------- ### Convert DOCX with Custom Style Map Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/cli.md Converts a .docx file to HTML using a custom style map file to control the output's HTML structure and styling. ```bash mammoth document.docx output.html --style-map styles.txt ``` -------------------------------- ### Basic Style Mapping Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/style-maps.md Map standard Word styles like Title, Subtitle, and Body to HTML elements. ```style-map p[style-name='Title'] => h1:fresh p[style-name='Subtitle'] => h2:fresh p[style-name='Body'] => p:fresh ``` -------------------------------- ### Handling Images During Conversion Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/00-START-HERE.md Illustrates how to process images within a DOCX document during conversion. The `save_image` function should be implemented to handle image saving logic. ```python def save_image(image): with image.open() as img: # Save to file return {"src": "path/to/image.png"} result = mammoth.convert_to_html( f, convert_image=mammoth.images.img_element(save_image) ) ``` -------------------------------- ### Mammoth Useful Constants Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/quick-reference.md Lists useful constants provided by the Mammoth library for document formatting and image conversion. ```python # Vertical alignment mammoth.documents.VerticalAlignment.baseline mammoth.documents.VerticalAlignment.superscript mammoth.documents.VerticalAlignment.subscript # Break types mammoth.documents.line_break mammoth.documents.page_break mammoth.documents.column_break # Image converters mammoth.images.data_uri # Default: embed as data URIs ``` -------------------------------- ### Create a Note Element Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/api-reference/documents.md Represents a footnote or endnote. Requires note type, a unique ID, and the body content. ```python mammoth.documents.Note(note_type, note_id, body) ``` -------------------------------- ### StringMatcher Factory Functions Source: https://github.com/mwilliamson/python-mammoth/blob/master/_autodocs/types.md Create matchers for style names using exact or prefix string comparisons. These are used in style mappings for document parsing. ```python import mammoth.document_matchers as matchers # Match styles exactly exact = matchers.equal_to("Heading 1") # Match styles by prefix prefix = matchers.starts_with("Heading") # Used in style mappings (via parser) # p[style-name='Heading 1'] => h1:fresh # p[style-name^='Heading'] => h2:fresh ```