### Install and Run Inscriptis Web Service Source: https://context7.com/weblyzard/inscriptis/llms.txt Instructions for installing the web service extras and starting the Inscriptis API using uvicorn or the provided Docker image. Covers installation, server startup, and Docker usage. ```bash # Install with web-service extras pip install inscriptis[web-service] # Start the server uvicorn inscriptis.service.web:app --host 127.0.0.1 --port 5000 # or: inscriptis-api # Docker docker pull ghcr.io/weblyzard/inscriptis:latest docker run -p 5000:5000 ghcr.io/weblyzard/inscriptis:latest ``` -------------------------------- ### Install Inscriptis using easy_install Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Alternative installation method using easy_install if pip is not available. ```bash $ easy_install inscriptis ``` -------------------------------- ### Install Inscriptis Web Service Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Command to install the Inscriptis library with the optional web-service feature. ```bash $ pip install inscriptis[web-service] ``` -------------------------------- ### Install Inscriptis using pip Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Install the Inscriptis library using pip. This is the recommended method for most users. ```bash $ pip install inscriptis ``` -------------------------------- ### Start Inscriptis Web Service Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Command to start the Inscriptis web service using uvicorn. ```bash $ uvicorn inscriptis.service.web:app --port 5000 --host 127.0.0.1 ``` -------------------------------- ### Print Fact Example Source: https://github.com/weblyzard/inscriptis/blob/master/RENDERING.md A simple Python print statement. No specific setup or constraints are mentioned. ```python print(fact) ``` -------------------------------- ### Example HTML for Annotation Source: https://github.com/weblyzard/inscriptis/blob/master/docs/README.rst A sample HTML snippet used to demonstrate how annotation rules are applied. ```html

Chur

Chur is the capital and largest town of the Swiss canton of the Grisons and lies in the Grisonian Rhine Valley. ``` -------------------------------- ### Inscriptis Annotation Profile JSON Example Source: https://context7.com/weblyzard/inscriptis/llms.txt An example JSON file defining annotation rules for the Inscriptis CLI. Shows how to map HTML elements and CSS selectors to annotation labels. ```json { "h1": ["heading", "h1"], "h2": ["heading", "h2"], "b": ["emphasis"], "div#class=toc": ["table-of-contents"], "#class=FactBox": ["fact-box"], "#cite": ["citation"], "a#title": ["entity"] } ``` -------------------------------- ### JSONL Output Example with Annotations Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Example of Inscriptis output in JSONL format, including extracted text and annotations for headings and emphasis. ```json {"text": "Chur\n\nChur is the capital and largest town of the Swiss canton of the Grisons and lies in the Grisonian Rhine Valley.", "label": [[0, 4, "heading"], [0, 4, "h1"], [6, 10, "emphasis"]]} ``` -------------------------------- ### Curl Request to Get Version Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Example cURL command to check the version of the Inscriptis web service. ```bash $ curl http://localhost:5000/version ``` -------------------------------- ### Command-line Usage with Postprocessor Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Example of using the Inscriptis command-line tool with a postprocessor to annotate content. ```bash $ inscript https://www.fhgr.ch \ -r ./examples/annotation/annotation-profile.json \ -p surface ``` -------------------------------- ### Inscriptis CLI Tool Usage Source: https://context7.com/weblyzard/inscriptis/llms.txt Examples demonstrating the usage of the `inscript` command-line tool for various conversion and annotation tasks. ```APIDOC ## Basic Usage ### Convert URL to text ```bash inscript https://en.wikipedia.org/wiki/Chur ``` ### Convert local file to text and save ```bash inscript page.html -o page.txt ``` ## Advanced Options ### Strict indentation ```bash inscript --indentation strict page.html -o page-strict.txt ``` ### Show link targets inline ```bash inscript -l https://example.com ``` ### Show image alt captions and deduplicate ```bash inscript -i -d page.html ``` ### Annotate using JSON rules and output raw JSONL ```bash inscript -r annotation-profile.json https://example.com ``` ### Annotate and postprocess to XML ```bash inscript -r annotation-profile.json -p xml https://example.com ``` ### Annotate and postprocess to surface forms (JSON) ```bash inscript -r annotation-profile.json -p surface https://example.com ``` ### Annotate and postprocess to highlighted HTML ```bash inscript -r annotation-profile.json -p html https://example.com -o annotated.html ``` ### Convert from stdin ```bash echo "

Hello world

" | inscript -o output.txt ``` ### Custom table cell separator ```bash inscript --table-cell-separator " | " page.html ``` ## Example `annotation-profile.json` ```json { "h1": ["heading", "h1"], "h2": ["heading", "h2"], "b": ["emphasis"], "div#class=toc": ["table-of-contents"], "#class=FactBox": ["fact-box"], "#cite": ["citation"], "a#title": ["entity"] } ``` ``` -------------------------------- ### Python Hello World Program Source: https://github.com/weblyzard/inscriptis/blob/master/RENDERING.md A basic 'Hello, world!' program in Python. This example is used to show rendering differences. ```python print('Hello, world!') ``` -------------------------------- ### Python Loop and Print Example Source: https://github.com/weblyzard/inscriptis/blob/master/tests/html/advanced-prefix-test.txt Demonstrates a basic Python for loop with a cumulative sum and subsequent print statements. ```python y=0 for x in range(3,10): print(x) y += x print(y) ``` ```python print("Hallo") print("Echo") print("123") ``` -------------------------------- ### HTML Snippet for Annotation Example Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst An example HTML snippet used to demonstrate how Inscriptis applies annotation rules. ```html

Chur

Chur is the capital and largest town of the Swiss canton of the Grisons and lies in the Grisonian Rhine Valley. ``` -------------------------------- ### Inscript CLI Usage Examples Source: https://context7.com/weblyzard/inscriptis/llms.txt Demonstrates various command-line interface commands for Inscriptis, including converting URLs, local files, applying strict indentation, showing link targets, handling images, annotating, and processing stdin. Covers common use cases and options. ```bash # Convert a URL to text inscript https://en.wikipedia.org/wiki/Chur # Convert a local file, save output inscript page.html -o page.txt # Strict indentation (Firefox-like, no extra div/span padding) inscript --indentation strict page.html -o page-strict.txt # Show link targets inline: [link text](URL) inscript -l https://example.com # Show image alt captions, deduplicate repeated ones inscript -i -d page.html # Annotate using a JSON rules file, output raw JSONL inscript -r annotation-profile.json https://example.com # Annotate + postprocess to XML inscript -r annotation-profile.json -p xml https://example.com # Annotate + postprocess to surface forms (JSON) inscript -r annotation-profile.json -p surface https://example.com # Annotate + postprocess to highlighted HTML inscript -r annotation-profile.json -p html https://example.com -o annotated.html # Convert from stdin echo "

Hello world

" | inscript -o output.txt # Custom table cell separator inscript --table-cell-separator " | " page.html ``` -------------------------------- ### JSON Output Example for Annotations Source: https://github.com/weblyzard/inscriptis/blob/master/docs/api.md Illustrates the expected JSON structure for text and annotations returned by `get_annotated_text()`. ```json {"text": "Chur\n\nChur is the capital and largest town of the Swiss canton of the Grisons and lies in the Grisonian Rhine Valley.", "label": [[0, 4, "heading"], [6, 10, "emphasis"]]} ``` -------------------------------- ### Annotated Text Output (JSONL) Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Example of JSON Lines (JSONL) output from Inscriptis when processing annotated HTML. It includes the extracted text and a list of labels with their start and end indices. ```json {"text": "Chur\n\nChur is the capital and largest town of the Swiss canton of the Grisons and lies in the Grisonian Rhine Valley.", "label": [[0, 4, "heading"], [0, 4, "h1"], [6, 10, "emphasis"]]} ``` -------------------------------- ### Curl Request to Get Text Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Example cURL command to send an HTML file to the Inscriptis web service and retrieve plain text. ```bash $ curl -X POST -H "Content-Type: text/html; encoding=UTF8" \ --data-binary @test.html http://localhost:5000/get_text ``` -------------------------------- ### XML Output with Annotations Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Example XML output from Inscriptis when using the 'xml' postprocessor, including annotated text. ```xml Chur Chur is the capital and largest town of the Swiss canton of the Grisons and lies in the Grisonian Rhine Valley. ``` -------------------------------- ### Python Code Rendering (lynx) Source: https://github.com/weblyzard/inscriptis/blob/master/RENDERING.md Python code examples as rendered by lynx, showing differences in whitespace handling compared to inscriptis. ```python Python programming examples[edit] Hello world program: print('Hello, world!') Program to calculate the factorial of a positive integer: n = int(input('Type a number, and its factorial will be printed: ')) if n < 0: raise ValueError('You must enter a positive integer') fact = 1 i = 2 while i <= n: fact *= i i += 1 ``` -------------------------------- ### Annotation Rules and Metadata Extraction Example Source: https://github.com/weblyzard/inscriptis/blob/master/docs/benchmarking.md Example JSON structure for annotation rules and extracted metadata, used in Inscriptis 2.0 and later for specific test cases. ```json { "annotation_rules": { "h1": ["heading"], "b": ["emphasis"] }, "result": [ ["heading", "The first"], ["heading", "The second"], ["heading", "Subheading"] ] } ``` -------------------------------- ### HTML to Text Conversion Example Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Demonstrates the difference in conversion quality between Inscriptis and Beautiful Soup for HTML enumerations. Inscriptis provides a more accurate, layout-aware output. ```html

first
second

Chur

Chur is the capital and largest town of the Swiss canton of the Grisons.

list[Annotation] ### Description Shift annotations based on the given line’s formatting. Adjusts the start and end indices of annotations based on the line’s formatting and width. ### Parameters * **annotations** (list[Annotation]) - A list of Annotations. * **content_width** (int) - The width of the actual content. * **line_width** (int) - The width of the line in which the content is placed. * **align** (HorizontalAlignment) - The horizontal alignment (left, right, center) to assume for the adjustment. * **shift** (int, optional) - An optional additional shift. Defaults to 0. ### Returns A list of [Annotation](#inscriptis.annotation.Annotation)s with the adjusted start and end positions. ``` -------------------------------- ### Benchmarking Script Configuration Source: https://github.com/weblyzard/inscriptis/blob/master/docs/benchmarking.md Configuration options for the Inscriptis benchmarking script, allowing selection of HTML-to-text algorithms to be executed. ```python run_lynx = True run_justext = True run_html2text = True run_beautifulsoup = True run_inscriptis = True ``` -------------------------------- ### Inscript Command Line Usage Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Overview of the command-line parameters for the inscript client, used for converting HTML to text. ```bash usage: inscript [-h] [-o OUTPUT] [-e ENCODING] [-i] [-d] [-l] [-a] [-r ANNOTATION_RULES] [-p POSTPROCESSOR] [--indentation INDENTATION] [--table-cell-separator TABLE_CELL_SEPARATOR] [-v] [input] Convert the given HTML document to text. positional arguments: input Html input either from a file or a URL (default:stdin). optional arguments: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT Output file (default:stdout). -e ENCODING, --encoding ENCODING Input encoding to use (default:utf-8 for files; detected server encoding for Web URLs). -i, --display-image-captions Display image captions (default:false). -d, --deduplicate-image-captions Deduplicate image captions (default:false). -l, --display-link-targets Display link targets (default:false). -a, --display-anchor-urls Display anchor URLs (default:false). -r ANNOTATION_RULES, --annotation-rules ANNOTATION_RULES Path to an optional JSON file containing rules for annotating the retrieved text. -p POSTPROCESSOR, --postprocessor POSTPROCESSOR Optional component for postprocessing the result (html, surface, xml). --indentation INDENTATION How to handle indentation (extended or strict; default: extended). --table-cell-separator TABLE_CELL_SEPARATOR Separator to use between table cells (default: three spaces). -v, --version display version information ``` -------------------------------- ### Annotation Postprocessors with Inscriptis Source: https://context7.com/weblyzard/inscriptis/llms.txt Shows how to use different annotation postprocessors (SurfaceExtractor, XmlExtractor, HtmlExtractor) to transform raw annotated text into various formats. Requires importing specific extractors and providing annotation rules. ```python from inscriptis import get_annotated_text from inscriptis.model.config import ParserConfig from inscriptis.annotation.output.surface import SurfaceExtractor from inscriptis.annotation.output.xml import XmlExtractor from inscriptis.annotation.output.html import HtmlExtractor html = "

Chur

Chur is the capital of Grisons.

with surrounding "===" markers css["h1"] = HtmlElement( display=Display.block, prefix="=== ", suffix=" ===", margin_before=1, margin_after=1, ) # Render
content as indented block css["aside"] = HtmlElement( display=Display.block, padding_inline=6, margin_before=1, margin_after=1, ) # Hide
elements entirely css["nav"] = HtmlElement(display=Display.none) # Preserve whitespace inside like a block css["code"] = HtmlElement( display=Display.inline, whitespace=WhiteSpace.pre, ) html = "Title skip me Side note" print(get_text(html, ParserConfig(css=css))) ``` -------------------------------- ### Convert HTML URL to Text via Command Line Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Convert an HTML page from a given URL to plain text and output it directly to the console using the inscript command-line client. ```bash $ inscript https://www.fhgr.ch ``` -------------------------------- ### Low-level HTML Parsing with Inscriptis Engine Source: https://context7.com/weblyzard/inscriptis/llms.txt Demonstrates using the `Inscriptis` class for low-level HTML parsing when you manage the lxml HTML tree. It accepts a pre-parsed tree and `ParserConfig` to extract text or annotations. ```python from lxml.html import fromstring from inscriptis.html_engine import Inscriptis from inscriptis.model.config import ParserConfig from inscriptis.css_profiles import CSS_PROFILES html = "Section First paragraph." html_tree = fromstring(html) config = ParserConfig(css=CSS_PROFILES["strict"]) parser = Inscriptis(html_tree, config) # Get plain text text = parser.get_text() print(text) ``` ```python # Get raw Annotation objects (requires annotation_rules in config) from inscriptis.model.config import ParserConfig config_ann = ParserConfig(annotation_rules={"h2": ["section"], "b": ["bold"]}) parser2 = Inscriptis(fromstring(html), config_ann) _ = parser2.get_text() for ann in parser2.get_annotations(): print(ann) # Annotation(start=0, end=7, metadata='section'), etc. ``` -------------------------------- ### Configure Parser with Strict CSS and Disabled Links Source: https://github.com/weblyzard/inscriptis/blob/master/docs/api.md Use this configuration to enable the strict CSS profile and prevent links from being displayed in the output. Ensure necessary imports are included. ```Python from inscriptis import get_text from inscriptis.css_profiles import CSS_PROFILES from inscriptis.model.config import ParserConfig css_profile = CSS_PROFILES['strict'].copy() config = ParserConfig(css=css_profile, display_links=False) text = get_text('first link', config) print(text) ``` -------------------------------- ### Convert HTML to Plain Text with Inscriptis Source: https://context7.com/weblyzard/inscriptis/llms.txt Use `get_text` to convert HTML to plain text. It handles complex layouts like tables and lists. You can customize the output using `ParserConfig` for CSS profiles, link display, and more. ```python import urllib.request from inscriptis import get_text from inscriptis.css_profiles import CSS_PROFILES from inscriptis.model.config import ParserConfig # Basic usage — convert a live web page html = urllib.request.urlopen("https://en.wikipedia.org/wiki/Chur").read().decode("utf-8") text = get_text(html) print(text[:500]) # With custom config: strict (Firefox-like) CSS, show links, show image alt text config = ParserConfig( css=CSS_PROFILES["strict"], display_links=True, display_images=True, deduplicate_captions=True, table_cell_separator=" ", ) text = get_text(html, config) # From a local string — handles nested tables, lists, pre blocks html_snippet = """ Cities Chur Zurich City Population Chur 38,000 """ print(get_text(html_snippet)) ``` -------------------------------- ### Docker Run Command Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Command to run the Inscriptis Docker container. ```bash $ docker run -n inscriptis ghcr.io/weblyzard/inscriptis:latest ``` -------------------------------- ### Convert HTML File to Text and Save to File Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Convert an HTML file to plain text and save the output to a specified file using the inscript command-line client. ```bash $ inscript fhgr.html -o fhgr.txt ``` -------------------------------- ### Lynx Nested Table Rendering Source: https://github.com/weblyzard/inscriptis/blob/master/RENDERING.md This output illustrates how the Lynx text-based browser renders the same nested table structure. It provides a comparison point for Inscriptis's text-based rendering. ```text Single First red green blue red green Second blue red green blue Nested red green blue red green blue red green blue blue red green blue blue red green blue red green blue red green blue red green blue red green blue red green blue blue red green blue ``` -------------------------------- ### Optimize Memory Consumption on Unix Systems Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Manually force lxml to release allocated memory on Unix systems to mitigate increased memory consumption in memory-intensive web services. ```python import ctypes def trim_memory() -> int: ``` -------------------------------- ### Transform HTML Tree to Text Source: https://github.com/weblyzard/inscriptis/blob/master/docs/api.md Demonstrates how to use the `Inscriptis` class to parse an lxml HTML tree and extract its text representation. ```python from lxml.html import fromstring from inscriptis.html_engine import Inscriptis html_content = "Test" # create an HTML tree from the HTML content. html_tree = fromstring(html_content) # transform the HTML tree to text. parser = Inscriptis(html_tree) text = parser.get_text() ``` -------------------------------- ### Table Class Methods Source: https://github.com/weblyzard/inscriptis/blob/master/docs/api.md Methods for managing HTML tables, including adding rows and cells. ```APIDOC ## Table Class Methods ### `add_cell(table_cell: TableCell)` method Add a new TableCell to the table’s last row. * **NOTE:** If no row exists yet, a new row is created. ### `add_row()` method Add an empty TableRow to the table. ``` -------------------------------- ### Inspect and Customize CSS Profiles in Inscriptis Source: https://context7.com/weblyzard/inscriptis/llms.txt Shows how to access, inspect, and derive custom CSS profiles from Inscriptis's built-in profiles ('strict' and 'relaxed'). Useful for fine-tuning text extraction by modifying tag rendering properties. ```python from inscriptis.css_profiles import CSS_PROFILES, STRICT_CSS_PROFILE, RELAXED_CSS_PROFILE from inscriptis.model.html_element import HtmlElement from inscriptis.html_properties import Display, WhiteSpace # Inspect available profiles print(list(CSS_PROFILES.keys())) # ['strict', 'relaxed'] # Use the strict profile directly strict = CSS_PROFILES["strict"] print(strict["p"]) # # Derive a custom profile from the relaxed baseline custom = CSS_PROFILES["relaxed"].copy() custom["blockquote"] = HtmlElement( display=Display.block, prefix="> ", margin_before=1, margin_after=1, ) custom["code"] = HtmlElement(display=Display.inline, prefix="`", suffix="`") from inscriptis import get_text from inscriptis.model.config import ParserConfig html = " Famous quote and print()" print(get_text(html, ParserConfig(css=custom))) ``` -------------------------------- ### Surface postprocessor output format Source: https://github.com/weblyzard/inscriptis/blob/master/docs/README.rst The 'surface' postprocessor outputs a list of mappings between annotation surface forms and their labels. ```text [ ['heading', 'Chur'], ['emphasis': 'Chur'] ] ``` -------------------------------- ### Convert HTML from Stdin to Text File Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Convert HTML content provided via standard input (stdin) to plain text and save the output to a file. This is useful for piping HTML data to the inscript client. ```bash $ echo "Make it so!" | inscript -o output.txt ``` -------------------------------- ### Parse HTML with Custom Link and Anchor Display Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Configure Inscriptis to display inline links and anchor URLs. Uses the strict CSS rendering profile. ```python from lxml.html import fromstring from inscriptis import Inscriptis from inscriptis.css_profiles import CSS_PROFILES from inscriptis.model.config import ParserConfig # uses the strict CSS rendering profile and fine-tune link handling. css = CSS_PROFILES['strict'] config = ParserConfig(css=css, display_links=True, display_anchors=True) html_tree = fromstring(html) parser = Inscriptis(html_tree, config) text = parser.get_text() ``` -------------------------------- ### Inscriptis with Strict CSS Profile for Whitespace Handling Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Python code demonstrating how to configure Inscriptis to use the 'strict' CSS profile for Firefox-like whitespace handling. ```python from lxml.html import fromstring from inscriptis import Inscriptis from inscriptis.css_profiles import CSS_PROFILES from inscriptis.model.config import ParserConfig # create a ParserConfig that uses the strict CSS rendering profile css = CSS_PROFILES['strict'] config = ParserConfig(css=css) html_tree = fromstring(html) parser = Inscriptis(html_tree, config) text = parser.get_text() ```