### Install and Run Inscriptis Web Service Source: https://context7.com/weblyzard/inscriptis/llms.txt Instructions for installing the web service extras and starting the Inscriptis API using uvicorn or the provided Docker image. Covers installation, server startup, and Docker usage. ```bash # Install with web-service extras pip install inscriptis[web-service] # Start the server uvicorn inscriptis.service.web:app --host 127.0.0.1 --port 5000 # or: inscriptis-api # Docker docker pull ghcr.io/weblyzard/inscriptis:latest docker run -p 5000:5000 ghcr.io/weblyzard/inscriptis:latest ``` -------------------------------- ### Install Inscriptis using easy_install Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Alternative installation method using easy_install if pip is not available. ```bash $ easy_install inscriptis ``` -------------------------------- ### Install Inscriptis Web Service Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Command to install the Inscriptis library with the optional web-service feature. ```bash $ pip install inscriptis[web-service] ``` -------------------------------- ### Install Inscriptis using pip Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Install the Inscriptis library using pip. This is the recommended method for most users. ```bash $ pip install inscriptis ``` -------------------------------- ### Start Inscriptis Web Service Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Command to start the Inscriptis web service using uvicorn. ```bash $ uvicorn inscriptis.service.web:app --port 5000 --host 127.0.0.1 ``` -------------------------------- ### Print Fact Example Source: https://github.com/weblyzard/inscriptis/blob/master/RENDERING.md A simple Python print statement. No specific setup or constraints are mentioned. ```python print(fact) ``` -------------------------------- ### Example HTML for Annotation Source: https://github.com/weblyzard/inscriptis/blob/master/docs/README.rst A sample HTML snippet used to demonstrate how annotation rules are applied. ```html
Hello world
" | inscript -o output.txt ``` ### Custom table cell separator ```bash inscript --table-cell-separator " | " page.html ``` ## Example `annotation-profile.json` ```json { "h1": ["heading", "h1"], "h2": ["heading", "h2"], "b": ["emphasis"], "div#class=toc": ["table-of-contents"], "#class=FactBox": ["fact-box"], "#cite": ["citation"], "a#title": ["entity"] } ``` ``` -------------------------------- ### Python Hello World Program Source: https://github.com/weblyzard/inscriptis/blob/master/RENDERING.md A basic 'Hello, world!' program in Python. This example is used to show rendering differences. ```python print('Hello, world!') ``` -------------------------------- ### Python Loop and Print Example Source: https://github.com/weblyzard/inscriptis/blob/master/tests/html/advanced-prefix-test.txt Demonstrates a basic Python for loop with a cumulative sum and subsequent print statements. ```python y=0 for x in range(3,10): print(x) y += x print(y) ``` ```python print("Hallo") print("Echo") print("123") ``` -------------------------------- ### HTML Snippet for Annotation Example Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst An example HTML snippet used to demonstrate how Inscriptis applies annotation rules. ```htmlHello world
" | inscript -o output.txt # Custom table cell separator inscript --table-cell-separator " | " page.html ``` -------------------------------- ### JSON Output Example for Annotations Source: https://github.com/weblyzard/inscriptis/blob/master/docs/api.md Illustrates the expected JSON structure for text and annotations returned by `get_annotated_text()`. ```json {"text": "Chur\n\nChur is the capital and largest town of the Swiss canton of the Grisons and lies in the Grisonian Rhine Valley.", "label": [[0, 4, "heading"], [6, 10, "emphasis"]]} ``` -------------------------------- ### Annotated Text Output (JSONL) Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Example of JSON Lines (JSONL) output from Inscriptis when processing annotated HTML. It includes the extracted text and a list of labels with their start and end indices. ```json {"text": "Chur\n\nChur is the capital and largest town of the Swiss canton of the Grisons and lies in the Grisonian Rhine Valley.", "label": [[0, 4, "heading"], [0, 4, "h1"], [6, 10, "emphasis"]]} ``` -------------------------------- ### Curl Request to Get Text Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Example cURL command to send an HTML file to the Inscriptis web service and retrieve plain text. ```bash $ curl -X POST -H "Content-Type: text/html; encoding=UTF8" \ --data-binary @test.html http://localhost:5000/get_text ``` -------------------------------- ### XML Output with Annotations Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Example XML output from Inscriptis when using the 'xml' postprocessor, including annotated text. ```xmlChur is the capital and largest town of the Swiss canton of the Grisons.
""" rules = { "h1": ["heading", "h1"], # tag rule "b": ["emphasis"], # tag rule "a#title": ["entity"], # attribute rule: list[Annotation] ### Description Shift annotations based on the given line’s formatting. Adjusts the start and end indices of annotations based on the line’s formatting and width. ### Parameters * **annotations** (list[Annotation]) - A list of Annotations. * **content_width** (int) - The width of the actual content. * **line_width** (int) - The width of the line in which the content is placed. * **align** (HorizontalAlignment) - The horizontal alignment (left, right, center) to assume for the adjustment. * **shift** (int, optional) - An optional additional shift. Defaults to 0. ### Returns A list of [Annotation](#inscriptis.annotation.Annotation)s with the adjusted start and end positions. ``` -------------------------------- ### Benchmarking Script Configuration Source: https://github.com/weblyzard/inscriptis/blob/master/docs/benchmarking.md Configuration options for the Inscriptis benchmarking script, allowing selection of HTML-to-text algorithms to be executed. ```python run_lynx = True run_justext = True run_html2text = True run_beautifulsoup = True run_inscriptis = True ``` -------------------------------- ### Inscript Command Line Usage Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Overview of the command-line parameters for the inscript client, used for converting HTML to text. ```bash usage: inscript [-h] [-o OUTPUT] [-e ENCODING] [-i] [-d] [-l] [-a] [-r ANNOTATION_RULES] [-p POSTPROCESSOR] [--indentation INDENTATION] [--table-cell-separator TABLE_CELL_SEPARATOR] [-v] [input] Convert the given HTML document to text. positional arguments: input Html input either from a file or a URL (default:stdin). optional arguments: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT Output file (default:stdout). -e ENCODING, --encoding ENCODING Input encoding to use (default:utf-8 for files; detected server encoding for Web URLs). -i, --display-image-captions Display image captions (default:false). -d, --deduplicate-image-captions Deduplicate image captions (default:false). -l, --display-link-targets Display link targets (default:false). -a, --display-anchor-urls Display anchor URLs (default:false). -r ANNOTATION_RULES, --annotation-rules ANNOTATION_RULES Path to an optional JSON file containing rules for annotating the retrieved text. -p POSTPROCESSOR, --postprocessor POSTPROCESSOR Optional component for postprocessing the result (html, surface, xml). --indentation INDENTATION How to handle indentation (extended or strict; default: extended). --table-cell-separator TABLE_CELL_SEPARATOR Separator to use between table cells (default: three spaces). -v, --version display version information ``` -------------------------------- ### Annotation Postprocessors with Inscriptis Source: https://context7.com/weblyzard/inscriptis/llms.txt Shows how to use different annotation postprocessors (SurfaceExtractor, XmlExtractor, HtmlExtractor) to transform raw annotated text into various formats. Requires importing specific extractors and providing annotation rules. ```python from inscriptis import get_annotated_text from inscriptis.model.config import ParserConfig from inscriptis.annotation.output.surface import SurfaceExtractor from inscriptis.annotation.output.xml import XmlExtractor from inscriptis.annotation.output.html import HtmlExtractor html = "Chur is the capital of Grisons.
" rules = {"h1": ["heading"], "b": ["emphasis"]} config = ParserConfig(annotation_rules=rules) annotated = get_annotated_text(html, config) # --- SurfaceExtractor: adds 'surface' key with (label, text) pairs --- surface_result = SurfaceExtractor()(annotated.copy()) print(surface_result["surface"]) # --- XmlExtractor: returns XML string with annotation tags --- xml_result = XmlExtractor()(annotated.copy()) print(xml_result) # --- HtmlExtractor: returns color-highlighted HTML --- html_result = HtmlExtractor()(annotated.copy()) # Returns full HTML page with highlights and injected CSS print(html_result[:200]) ``` -------------------------------- ### Annotate HTML from Web Page using CLI Source: https://github.com/weblyzard/inscriptis/blob/master/docs/README.rst Use the inscript CLI with an annotation rules file to convert and annotate HTML content from a web page. ```bash $ inscript https://www.fhgr.ch -r annotation-profile.json ``` -------------------------------- ### Using Inscriptis Annotation Named Tuple Source: https://context7.com/weblyzard/inscriptis/llms.txt Demonstrates the creation and access of fields within the `Annotation` named tuple. This is useful for understanding the structure of individual annotations. ```python from inscriptis.annotation import Annotation # Annotation is a NamedTuple: (start, end, metadata) ann = Annotation(start=0, end=4, metadata="heading") print(ann.start, ann.end, ann.metadata) # 0 4 heading ``` -------------------------------- ### Ignore Elements During Parsing Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Prevent specific HTML elements from appearing in the parsed text by setting their display property to `Display.none`. This example removes `form` elements. ```python from inscriptis import get_text from inscriptis.css_profiles import CSS_PROFILES, HtmlElement from inscriptis.html_properties import Display from inscriptis.model.config import ParserConfig # create a custom CSS based on the default style sheet and change the # rendering of `div` and `span` elements css = CSS_PROFILES['strict'].copy() css['form'] = HtmlElement(display=Display.none) # create a parser configuration using a custom css html = """First line. """ config = ParserConfig(css=css) text = get_text(html, config) print(text) ``` -------------------------------- ### Override Default CSS Definitions for Rendering Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Customize the rendering of HTML tags by overriding default CSS definitions. This example modifies `div` and `span` elements. ```python from lxml.html import fromstring from inscriptis import Inscriptis from inscriptis.css_profiles import CSS_PROFILES from inscriptis.html_properties import Display from inscriptis.model.config import ParserConfig from inscriptis.model.html_element import HtmlElement # Create a custom CSS based on the default style sheet and change the # rendering of `div` and `span` elements. css = CSS_PROFILES['strict'].copy() css['div'] = HtmlElement(display=Display.block, padding=2) css['span'] = HtmlElement(prefix=' ', suffix=' ') html_tree = fromstring(html) # create a parser using a custom css config = ParserConfig(css=css) parser = Inscriptis(html_tree, config) text = parser.get_text() ``` -------------------------------- ### Command-line Usage with HTML Postprocessor Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Command to convert a Wikipedia page to HTML with annotations, using specified annotation rules. ```bash inscript --annotation-rules ./wikipedia.json \ --postprocessor html \ https://en.wikipedia.org/wiki/Chur ``` -------------------------------- ### Programmatic Use of Inscriptis for Annotated Text Source: https://github.com/weblyzard/inscriptis/blob/master/README.rst Python code demonstrating how to use Inscriptis to extract annotated text from a URL, including custom annotation rules. ```python import urllib.request from inscriptis import get_annotated_text from inscriptis.model.config import ParserConfig url = "https://www.fhgr.ch" html = urllib.request.urlopen(url).read().decode('utf-8') rules = {'h1': ['heading', 'h1'], 'h2': ['heading', 'h2'], 'b': ['emphasis'], 'table': ['table'] } output = get_annotated_text(html, ParserConfig(annotation_rules=rules)) print("Text:", output['text']) print("Annotations:", output['label']) ``` -------------------------------- ### Shift Annotations Horizontally Source: https://github.com/weblyzard/inscriptis/blob/master/docs/api.md Adjust the start and end indices of annotations based on line formatting, content width, and alignment. This is useful for adapting annotations after text reformatting. ```python inscriptis.annotation.horizontal_shift(annotations: list[[Annotation](#inscriptis.annotation.Annotation)], content_width: int, line_width: int, align: [HorizontalAlignment](#inscriptis.html_properties.HorizontalAlignment), shift: int = 0) → list[[Annotation](#inscriptis.annotation.Annotation)] ``` -------------------------------- ### inscriptis.model.canvas.block.Block.merge_normal_text Source: https://github.com/weblyzard/inscriptis/blob/master/docs/api.md Merges the given text with the current block, handling normal text. If the previous text ended with a whitespace and the new text starts with one, they will collapse into a single whitespace. ```APIDOC ## merge_normal_text(text: str) -> None ### Description Merge the given text with the current block. ### Parameters #### Path Parameters - **text** (str) - Required - the text to merge ### NOTE If the previous text ended with a whitespace and text starts with one, both : will automatically collapse into a single whitespace. ``` -------------------------------- ### Prefix Class Methods Source: https://github.com/weblyzard/inscriptis/blob/master/docs/api.md Methods for managing prefixes used in text formatting, such as registering, popping, and removing prefixes. ```APIDOC ## Prefix Class Methods ### `first` property Return the prefix used at the beginning of a tag. ### `pop_next_bullet()` method Pop the next bullet to use, if any bullet is available. ### `register_prefix(padding_inline: int, bullet: str)` method Register the given prefix. * **Parameters:** * **padding_inline** – the number of characters used for padding_inline * **bullet** – an optional bullet. ### `remove_last_prefix()` method Remove the last prefix from the list. ### `rest` property Return the prefix used for new lines within a block. ### `unconsumed_bullet` property Yield any yet unconsumed bullet. ``` -------------------------------- ### Custom HTML Tag Handling in Inscriptis Source: https://context7.com/weblyzard/inscriptis/llms.txt Demonstrates how to define custom handlers for HTML tags like and to control their output format. Requires defining handler functions and a CustomHtmlTagHandlerMapping. ```python from inscriptis import Inscriptis from inscriptis.model.config import ParserConfig from inscriptis.html_tag_handler import CustomHtmlTagHandlerMapping from inscriptis.model.html_document import HtmlDocumentState def handle_start_b(state: HtmlDocumentState, attrs: dict) -> None: state.tags[-1].write("**") def handle_end_b(state: HtmlDocumentState) -> None: state.tags[-1].write("**") def handle_start_em(state: HtmlDocumentState, attrs: dict) -> None: state.tags[-1].write("_") def handle_end_em(state: HtmlDocumentState) -> None: state.tags[-1].write("_") mapping = CustomHtmlTagHandlerMapping( start_tag_mapping={"b": handle_start_b, "em": handle_start_em}, end_tag_mapping={"b": handle_end_b, "em": handle_end_em}, ) config = ParserConfig(custom_html_tag_handler_mapping=mapping) html = "Welcome to Chur, the oldest city." parser = Inscriptis(html, config=config) print(parser.get_text()) ``` -------------------------------- ### Configure HtmlElement Properties for Tag Rendering in Inscriptis Source: https://context7.com/weblyzard/inscriptis/llms.txt Illustrates how to define custom rendering properties for HTML tags using `HtmlElement`. This allows control over display mode, margins, padding, prefixes, suffixes, whitespace handling, and alignment for specific tags. ```python from inscriptis.html_properties import Display, WhiteSpace, HorizontalAlignment, VerticalAlignment from inscriptis.model.html_element import HtmlElement from inscriptis.model.config import ParserConfig from inscriptis.css_profiles import CSS_PROFILES from inscriptis import get_text css = CSS_PROFILES["strict"].copy() # Render