### Inline Code Example Source: https://github.com/dfop02/html4docx/blob/main/tests/assets/htmls/code.html This snippet shows a single line of code enclosed in backticks, intended for inline display within text. ```code This is a code block. That should be NOT be pre-formatted. It should NOT retain carriage returns, or all white space. or blank lines. Tabs tabs tabs tabs spac spac spac spac ``` -------------------------------- ### Run html4docx with BeautifulSoup HTML fixing Source: https://context7.com/dfop02/html4docx/llms.txt This command runs the html4docx converter with BeautifulSoup enabled for HTML fixing. Ensure you have html4docx installed via pip. ```bash python -m html4docx.h4d input.html output_report --bs ``` -------------------------------- ### Initialize HtmlToDocx Parser Source: https://context7.com/dfop02/html4docx/llms.txt Create an instance of the parser. Sensible defaults are applied if parameters are omitted. Custom style maps and tag overrides can be provided for advanced styling. ```python from html4docx import HtmlToDocx # Minimal — use all defaults parser = HtmlToDocx() # With CSS-class-to-Word-style mapping and tag overrides style_map = { 'code-block': 'Code Block', 'finding-critical': 'Finding Critical', } tag_overrides = { 'h1': 'Custom Heading 1', 'pre': 'Code Block', } parser = HtmlToDocx( style_map=style_map, tag_style_overrides=tag_overrides, default_paragraph_style='Body Text', ) ``` -------------------------------- ### Clone Parser Settings Source: https://context7.com/dfop02/html4docx/llms.txt Demonstrates using `HtmlToDocx.copy_settings_from` to propagate parser configurations like table style, `style_map`, `tag_style_overrides`, and `default_paragraph_style` from one parser instance to another. This is useful for reusing configurations across multiple documents. ```python from docx import Document from html4docx import HtmlToDocx # Master parser with full configuration master_parser = HtmlToDocx( style_map={'highlight': 'Intense Quote'}, tag_style_overrides={'h1': 'Title'}, default_paragraph_style='Body Text', ) master_parser.table_style = 'Light Grid' # Child parser shares settings child_parser = HtmlToDocx() child_parser.copy_settings_from(master_parser) doc = Document() child_parser.add_html_to_document('
See diagram: for details.
Visit our website at example.com for more information.
Jump to the introduction above.
""" parser = HtmlToDocx() doc = parser.parse_html_string(html) doc.save('links.docx') # External links: blue (#0000EE) underlined text with optional tooltip # Internal anchors: href="#bookmark-id" links to elements with matching id="" ``` -------------------------------- ### Configure Conversion Options Source: https://github.com/dfop02/html4docx/blob/main/README.md Customize the HTML to DOCX conversion process by enabling or disabling various features like images, tables, styles, and HTML fixing. The options are set as boolean values in the parser's 'options' dictionary. ```python from html4docx import HtmlToDocx parser = HtmlToDocx() parser.options['images'] = False # Default True parser.options['tables'] = False # Default True parser.options['styles'] = False # Default True parser.options['fix-html'] = False # Default True parser.options['html-comments'] = False # Default False parser.options['style-map'] = False # Default True parser.options['tag-override'] = False # Default True docx = parser.parse_html_string(input_html_file_string) ``` -------------------------------- ### Apply Semantic Inline Formatting in DOCX Source: https://context7.com/dfop02/html4docx/llms.txt Illustrates mapping of HTML semantic inline tags like ``, ``, ``, ``, and `` to Word character formatting. `` applies a yellow background highlight, and ``/`` use Courier font.
```python
from html4docx import HtmlToDocx
html = """
Bold, also bold,
italic, also italic,
underlined, inserted (underlined),
strikethrough, deleted (strikethrough),
highlighted in yellow,
H2O and E=mc2.
Inline code snippet uses Courier font.
def block_code():
return "pre block also uses Courier"
"""
parser = HtmlToDocx()
doc = parser.parse_html_string(html)
doc.save('inline_tags.docx')
```
--------------------------------
### Inline Code Element
Source: https://github.com/dfop02/html4docx/blob/main/tests/assets/htmls/code.html
Demonstrates the use of inline code formatting for short code fragments within a sentence.
```text
code
```
--------------------------------
### Set Default Paragraph Style
Source: https://github.com/dfop02/html4docx/blob/main/README.md
Configure the default paragraph style for the document. Use 'Body' for the default behavior or None to use Word's 'Normal' style.
```python
# Use 'Body' as default (default behavior)
parser = HtmlToDocx(default_paragraph_style='Body')
# Use Word's default 'Normal' style
parser = HtmlToDocx(default_paragraph_style=None)
```
--------------------------------
### Use Custom Styles from a Word Template
Source: https://github.com/dfop02/html4docx/blob/main/README.md
Apply custom styles defined in a Word template (.docx) by passing the template document to the HtmlToDocx parser. Save the output document to preserve these styles.
```python
from docx import Document
from html4docx import HtmlToDocx
doc = Document("path/to/template.docx") # template has Code Block, Custom Markdown, etc.
parser = HtmlToDocx(tag_style_overrides={"code": "Custom Markdown", "pre": "Code Block"})
parser.add_html_to_document(html, doc)
doc.save("output.docx") # save the template-based doc so custom styles are preserved
```
--------------------------------
### Map CSS Classes to Word Styles
Source: https://context7.com/dfop02/html4docx/llms.txt
Use the `style_map` parameter to map HTML classes to specific Word paragraph styles. This requires a document template that defines these styles.
```python
from docx import Document
from html4docx import HtmlToDocx
style_map = {
'note': 'Quote',
'warning': 'Intense Quote',
'code-block': 'No Spacing',
}
doc = Document('path/to/branded_template.docx')
parser = HtmlToDocx(style_map=style_map)
html = """
This is a note paragraph.
Warning: data loss may occur.
def hello(): pass
"""
parser.add_html_to_document(html, doc)
doc.save('styled_output.docx')
```
--------------------------------
### Save Document to Path or BytesIO
Source: https://context7.com/dfop02/html4docx/llms.txt
Saves the underlying document to a file path or an in-memory BytesIO buffer. The '.docx' extension is automatically appended if not present.
```python
from io import BytesIO
from docx import Document
from html4docx import HtmlToDocx
```
--------------------------------
### Save Document to File or Buffer
Source: https://context7.com/dfop02/html4docx/llms.txt
Demonstrates saving a generated Word document to a file path or an in-memory BytesIO buffer. The buffer can be used for web responses.
```python
from html4docx import HtmlToDocx
from io import BytesIO
document = Document()
parser = HtmlToDocx()
parser.add_html_to_document('Hello
', document)
parser.save('output/hello') # Saves as output/hello.docx
buffer = BytesIO()
document2 = Document()
parser2 = HtmlToDocx()
parser2.add_html_to_document('Report content
', document2)
parser2.save(buffer)
buffer.seek(0)
```
--------------------------------
### Configure Parser Options
Source: https://context7.com/dfop02/html4docx/llms.txt
Control parser behavior by modifying the `options` dictionary. Disable image embedding, table rendering, and style application for plain text output, or enable HTML comment rendering.
```python
from html4docx import HtmlToDocx
parser = HtmlToDocx()
# Default values:
# parser.options['images'] = True — embed images
# parser.options['tables'] = True — render HTML tables
# parser.options['styles'] = True — apply CSS styles
# parser.options['fix-html'] = True — run BeautifulSoup HTML cleanup
# parser.options['html-comments'] = False — render as visible green text
# parser.options['style-map'] = True — apply CSS-class → Word style mapping
# parser.options['tag-override'] = True — apply tag → Word style overrides
# Strip all styling and images for a plain-text-only docx
parser.options['images'] = False
parser.options['styles'] = False
parser.options['tables'] = False
# Render HTML comments as visible italic green text
parser.options['html-comments'] = True
doc = parser.parse_html_string('Content
')
doc.save('plain.docx')
```
--------------------------------
### Read and Set Document Metadata
Source: https://context7.com/dfop02/html4docx/llms.txt
Explains how to access and modify the document's built-in metadata (author, title, subject, etc.) using the `parser.metadata` property. Invalid revision or datetime strings will print a warning and be skipped.
```python
from docx import Document
from html4docx import HtmlToDocx
document = Document()
parser = HtmlToDocx()
parser.set_initial_attrs(document)
metadata = parser.metadata
# Read all metadata as a dict
props = metadata.get_metadata()
print(props.get('author')) # e.g., '' (empty on new document)
print(props.get('created')) # datetime object
# Print all metadata to stdout as formatted JSON
metadata.get_metadata(print_result=True)
# Set metadata fields
metadata.set_metadata(
author='Jane Smith',
title='Q4 Financial Report',
subject='Finance',
keywords='finance, quarterly, 2024',
description='Official Q4 2024 report',
revision='3',
created='2024-01-01T00:00:00',
modified='2024-12-31T23:59:59',
)
parser.add_html_to_document('Q4 Report
Content here.
', document)
document.save('q4_report.docx')
# Invalid revision (non-integer) or datetime string prints a warning and skips that field.
```
--------------------------------
### Apply Table Styles
Source: https://github.com/dfop02/html4docx/blob/main/README.md
Set a specific table style for all tables converted from HTML. The 'table_style' attribute must be set on the parser instance before conversion. Supported styles can be found in the python-docx documentation.
```python
from html4docx import HtmlToDocx
parser = HtmlToDocx()
parser.table_style = 'Light Shading Accent 4'
docx = parser.parse_html_string(input_html_file_string)
```
```python
parser.table_style = 'Table Grid'
```
--------------------------------
### Add HTML to Existing DOCX Document
Source: https://github.com/dfop02/html4docx/blob/main/README.md
Use this method to add HTML-formatted content to an existing .docx document. Requires the HtmlToDocx parser and a filename for the output.
```python
from html4docx import HtmlToDocx
parser = HtmlToDocx()
html_string = 'Hello world
'
parser.add_html_to_document(html_string, filename_docx)
```
--------------------------------
### Apply Word Table Style
Source: https://context7.com/dfop02/html4docx/llms.txt
Set the `table_style` attribute to apply a specific Word table style to all tables generated from HTML. Ensure the style exists in your document template.
```python
from html4docx import HtmlToDocx
parser = HtmlToDocx()
parser.table_style = 'Table Grid' # Bordered grid style
html = """
Name Score
Alice 95
Bob 87
"""
doc = parser.parse_html_string(html)
doc.save('scores.docx')
```
--------------------------------
### HtmlToDocx.parse_html_file
Source: https://context7.com/dfop02/html4docx/llms.txt
Reads an HTML file from disk, converts its content, and saves the result as a .docx file. An optional encoding parameter can be specified for handling different file encodings.
```APIDOC
## HtmlToDocx.parse_html_file — Convert an HTML file to a `.docx` file
Reads an HTML file from disk, converts it, and saves the result as a `.docx` file. Supports specifying an alternative encoding.
```python
from html4docx import HtmlToDocx
parser = HtmlToDocx()
# Basic usage — output saved alongside input file
parser.parse_html_file('report.html', 'output/report.docx')
# With explicit encoding
parser.parse_html_file('legacy_report.html', 'output/legacy_report.docx', encoding='latin-1')
# Output filename is optional; defaults to new_docx_file_ in same directory
parser.parse_html_file('report.html', None)
# Saves as: new_docx_file_report.html (alongside report.html)
```
```
--------------------------------
### Apply Inline CSS for Text and Paragraph Properties
Source: https://context7.com/dfop02/html4docx/llms.txt
Use inline style attributes on HTML elements to control typography, color, spacing, and decoration for runs or paragraph formats in the DOCX. Supported properties include text-align, line-height, margin-left, font-family, font-size, color, font-weight, text-indent, text-decoration, background-color, font-style, and text-transform.
```python
from html4docx import HtmlToDocx
html = """
Centered, spaced heading-like paragraph
First line indented with
wavy red underline
and yellow highlight.
Blue italic paragraph using RGB color.
this text will be uppercased in courier via serif generic mapping.
"""
parser = HtmlToDocx()
doc = parser.parse_html_string(html)
doc.save('styled_text.docx')
```
--------------------------------
### Convert HTML File Directly to DOCX
Source: https://github.com/dfop02/html4docx/blob/main/README.md
Convert an HTML file to a DOCX file using the parse_html_file method. You can specify the input HTML file path, output DOCX file path, and optionally the file encoding (defaults to 'utf-8').
```python
from html4docx import HtmlToDocx
parser = HtmlToDocx()
parser.parse_html_file(input_html_file_path, output_docx_file_path)
# You can also define a encoding, by default is utf-8
parser.parse_html_file(input_html_file_path, output_docx_file_path, 'utf-8')
```
--------------------------------
### Incrementally Add HTML to Document and Save
Source: https://github.com/dfop02/html4docx/blob/main/README.md
Add multiple HTML snippets to a document incrementally. The content is appended to the end of the document. Saving can be done using either python-docx's document.save() or html4docx's parser.save().
```python
from docx import Document
from html4docx import HtmlToDocx
document = Document()
parser = HtmlToDocx()
for part in ['First', 'Second', 'Third']:
parser.add_html_to_document(f'{part} Part
', document)
parser.save('your_file_name.docx')
```
--------------------------------
### Map CSS Classes to Word Styles
Source: https://github.com/dfop02/html4docx/blob/main/README.md
Define a mapping between HTML CSS classes and Word document styles to control the appearance of specific HTML elements. Pass the style map as an argument during HtmlToDocx instantiation or use add_html_to_document.
```python
from html4docx import HtmlToDocx
style_map = {
'code-block': 'Code Block',
'numbered-heading-1': 'Heading 1 Numbered',
'finding-critical': 'Finding Critical'
}
parser = HtmlToDocx(style_map=style_map)
parser.add_html_to_document(html, document)
```
--------------------------------
### Convert HTML String to a New Document
Source: https://context7.com/dfop02/html4docx/llms.txt
Parses an HTML string and returns a new python-docx Document object. This is suitable for one-shot conversions.
```python
from html4docx import HtmlToDocx
html = """
Product Specification
Model: X-500
Note: Subject to change without notice.
- Step one: Unbox the unit
- Step two: Connect to power
- Use the provided cable
- Verify LED indicator
"""
parser = HtmlToDocx()
doc = parser.parse_html_string(html)
doc.save('spec_sheet.docx')
# Returns: docx.document.Document
```
--------------------------------
### Manage Document Metadata
Source: https://github.com/dfop02/html4docx/blob/main/README.md
Read and set document metadata such as author and creation date using the parser's metadata attributes. Available attributes can be found in the python-docx documentation.
```python
from docx import Document
from html4docx import HtmlToDocx
document = Document()
parser = HtmlToDocx()
parser.set_initial_attrs(document)
metadata = parser.metadata
# You can get metadata as dict
metadata_json = metadata.get_metadata()
print(metadata_json['author']) # Jane
# or just print all metadata if if you want
metadata.get_metadata(print_result=True)
# Set new metadata
metadata.set_metadata(author="Jane", created="2025-07-18T09:30:00")
document.save('your_file_name.docx')
```
--------------------------------
### HtmlToDocx.save
Source: https://context7.com/dfop02/html4docx/llms.txt
Saves the underlying document to a specified path or a BytesIO buffer. If a path is provided, the '.docx' extension is automatically appended. This method is useful for file persistence or streaming document data.
```APIDOC
## HtmlToDocx.save — Save the document to a path or BytesIO buffer
Saves the underlying document either to a file path (`.docx` extension appended automatically) or to an in-memory `BytesIO` buffer for streaming use cases (e.g., HTTP responses).
```python
from io import BytesIO
from docx import Document
from html4docx import HtmlToDocx
```
```
--------------------------------
### Pre-formatted Text Block
Source: https://github.com/dfop02/html4docx/blob/main/tests/assets/htmls/code.html
This snippet represents a pre-formatted text block, retaining all whitespace and line breaks exactly as they appear in the source.
```text
This is a pre-formatted block.
That should be pre-formatted.
Retaining any carriage returns, and all white space.
And blank lines.
Tabs tabs tabs tabs
spac spac spac spac
```
--------------------------------
### Save Document to In-Memory Buffer
Source: https://github.com/dfop02/html4docx/blob/main/README.md
Utilize BytesIO to save the DOCX document in memory. This is useful for applications that need to handle the document data without writing to a physical file immediately. Remember to reset the buffer's position after saving if you intend to read from it.
```python
from io import BytesIO
from docx import Document
from html4docx import HtmlToDocx
buffer = BytesIO()
document = Document()
parser = HtmlToDocx()
html_string = 'Hello world
'
parser.add_html_to_document(html_string, document)
# Save the document to the in-memory buffer
parser.save(buffer)
# If you need to read from the buffer again after saving,
# you might need to reset its position to the beginning
buffer.seek(0)
```
--------------------------------
### Style HTML Table Cells with CSS
Source: https://context7.com/dfop02/html4docx/llms.txt
Apply CSS properties to HTML table cells (, ) for borders, background color, dimensions, and text alignment. Supported properties include border shorthand/longhand, background-color, width, height, color, and vertical-align. The 'Table Grid' style can be applied to the document for consistent table formatting.
```python
from html4docx import HtmlToDocx
html = """
Header A
Header B
Top-aligned dashed cell
Left accent border cell
Merged cell spanning 2 columns
"""
parser = HtmlToDocx()
parser.table_style = 'Table Grid'
doc = parser.parse_html_string(html)
doc.save('styled_table.docx')
# Supported border keywords: thin (1px), medium (3px), thick (5px)
# Supported border styles: solid, dashed, dotted, double, inset, outset
```
--------------------------------
### Override Default Tag Styles
Source: https://github.com/dfop02/html4docx/blob/main/README.md
Customize the styles applied to specific HTML tags like 'h1' and 'pre'. Ensure the target styles exist in your Word document.
```python
tag_overrides = {
'h1': 'Custom Heading 1',
'pre': 'Code Block'
}
parser = HtmlToDocx(tag_style_overrides=tag_overrides)
```
--------------------------------
### Apply Inline CSS Styles
Source: https://github.com/dfop02/html4docx/blob/main/README.md
Utilize inline CSS styles directly within HTML tags for precise formatting of text and paragraphs. Supported properties include color, font-size, font-weight, and more.
```html
Red 14pt paragraph
Bold blue text
```
--------------------------------
### Convert HTML String Directly to DOCX
Source: https://github.com/dfop02/html4docx/blob/main/README.md
Convert an HTML string into a DOCX document object using parse_html_string. The method returns the DOCX object, which can then be saved.
```python
from html4docx import HtmlToDocx
parser = HtmlToDocx()
docx = parser.parse_html_string(input_html_file_string)
```
--------------------------------
### Insert Page Breaks and Horizontal Rules in DOCX
Source: https://context7.com/dfop02/html4docx/llms.txt
Shows how to insert page breaks using CSS page-break properties and horizontal rules using the `
` tag when converting HTML to DOCX. The `
` tag renders as a paragraph-bottom border line.
```python
from html4docx import HtmlToDocx
html = """
Chapter 1
Content for chapter one.
Chapter 2
Content for chapter two starts on a new page.
Section below the horizontal rule.
Chapter 3
"""
parser = HtmlToDocx()
doc = parser.parse_html_string(html)
doc.save('paged_document.docx')
```
--------------------------------
### Override HTML Tag to Word Style Mapping
Source: https://context7.com/dfop02/html4docx/llms.txt
Use the `tag_overrides` parameter to replace default tag-to-style mappings with custom Word styles. This is useful for structural tags like headings and preformatted text.
```python
from docx import Document
from html4docx import HtmlToDocx
tag_overrides = {
'h1': 'Report Title',
'h2': 'Section Header',
'pre': 'Code Block',
}
doc = Document('template_with_custom_styles.docx')
parser = HtmlToDocx(tag_overrides=tag_overrides)
html = """
Executive Summary
Background
SELECT * FROM reports WHERE year = 2024;
"""
parser.add_html_to_document(html, doc)
doc.save('executive_summary.docx')
```
--------------------------------
### Add HTML to python-docx Document Object
Source: https://github.com/dfop02/html4docx/blob/main/README.md
Integrate HTML content directly into a python-docx Document object. This allows for further manipulation of the document before saving.
```python
from docx import Document
from html4docx import HtmlToDocx
document = Document()
parser = HtmlToDocx()
html_string = 'Hello world
'
parser.add_html_to_document(html_string, document)
document.save('your_file_name.docx')
```
--------------------------------
### Add HTML to an Existing Document
Source: https://context7.com/dfop02/html4docx/llms.txt
Append parsed HTML content to a python-docx Document object. This method can be called multiple times to build a document incrementally. Ensure the input is a string and the document is a valid Document or _Cell object.
```python
from docx import Document
from html4docx import HtmlToDocx
document = Document()
parser = HtmlToDocx()
html_parts = [
'Annual Report 2024
',
'This report covers all fiscal quarters.
',
'- Q1: $1.2M revenue
- Q2: $1.5M revenue
',
'Quarter Revenue '
'Q1 $1.2M '
'Q2 $1.5M
',
]
for part in html_parts:
parser.add_html_to_document(part, document)
document.save('annual_report.docx')
# Raises ValueError if html is not str or document is not a Document/_Cell
```
--------------------------------
### HtmlToDocx.parse_html_string
Source: https://context7.com/dfop02/html4docx/llms.txt
Converts an HTML string into a new python-docx Document object. This method is suitable for single, self-contained HTML conversions.
```APIDOC
## HtmlToDocx.parse_html_string — Convert HTML string to a new Document
Pareses an HTML string and returns a brand-new `python-docx` `Document` object. Ideal for one-shot conversions.
```python
from html4docx import HtmlToDocx
html = """
Product Specification
Model: X-500
Note: Subject to change without notice.
- Step one: Unbox the unit
- Step two: Connect to power
- Use the provided cable
- Verify LED indicator
"""
parser = HtmlToDocx()
doc = parser.parse_html_string(html)
doc.save('spec_sheet.docx')
# Returns: docx.document.Document
```
```
--------------------------------
### HtmlToDocx.add_html_to_document
Source: https://context7.com/dfop02/html4docx/llms.txt
Appends parsed HTML content to an existing python-docx Document object. This method is the primary interface for programmatic use and can be called multiple times to build a document incrementally.
```APIDOC
## HtmlToDocx.add_html_to_document — Append HTML to an existing Document object
The primary method for programmatic use. Appends the parsed HTML content at the end of a `python-docx` `Document` object. Can be called multiple times to build a document incrementally.
```python
from docx import Document
from html4docx import HtmlToDocx
document = Document()
parser = HtmlToDocx()
html_parts = [
'Annual Report 2024
',
'This report covers all fiscal quarters.
',
'- Q1: $1.2M revenue
- Q2: $1.5M revenue
',
'Quarter Revenue '
'Q1 $1.2M '
'Q2 $1.5M
',
]
for part in html_parts:
parser.add_html_to_document(part, document)
document.save('annual_report.docx')
# Raises ValueError if html is not str or document is not a Document/_Cell
```
```
--------------------------------
### Utilize !important Flag in Inline CSS
Source: https://github.com/dfop02/html4docx/blob/main/README.md
Ensure the highest CSS precedence for inline styles by using the '!important' flag. This overrides other style declarations.
```html
Gray text with red important.
```
--------------------------------
### Handle !important CSS Flag for Style Overrides
Source: https://context7.com/dfop02/html4docx/llms.txt
Styles marked with !important on a child element will override any parent-level styles for the same property, mimicking CSS cascade behavior. This is useful for ensuring specific styles take precedence.
```python
from html4docx import HtmlToDocx
html = """
Normal gray text,
important red override
back to gray.
"""
parser = HtmlToDocx()
doc = parser.parse_html_string(html)
doc.save('important_styles.docx')
# The span overrides the paragraph's gray color and 11pt size with red and 14pt.
```
=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.