### ftfy Negative Examples and Manual Fixes Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb Demonstrates text examples that ftfy does not alter because they are not considered mojibake. It also shows how to manually attempt fixes using different encodings when ftfy's automatic correction is not applied. ```python NEGATIVE_EXAMPLES = [ "Con il corpo e lo spirito ammaccato,\u00a0è come se nel cuore avessi un vetro conficcato.", "2012—∞", "TEM QUE SEGUIR, SDV SÓ…", "Join ZZAJÉ’s Official Fan List", "(-1/2)! = √π", "OK??:( `¬´ ):" ] for example in NEGATIVE_EXAMPLES: # ftfy doesn't "fix" these because they're not broken, but we can manually try fixes try: print(example.encode('sloppy-windows-1252').decode('utf-8')) except UnicodeError: print(example.encode('macroman').decode('utf-8')) assert ftfy.fix_encoding(example) == example ``` -------------------------------- ### Fix Text Examples Source: https://github.com/rspeer/python-ftfy/blob/main/README.md Illustrates the core functionality of ftfy's `fix_text` function with various real-world examples of mojibake, including nested encoding issues, curly quotes, and incorrect handling of non-breaking spaces. ```python import ftfy # Basic mojibake print(ftfy.fix_text('✔ No problems')) # Multiple layers of mojibake print(ftfy.fix_text('The Mona Lisa doesn’t have eyebrows.')) # Mojibake with curly quotes print(ftfy.fix_text("l’humanité")) # Mojibake with non-breaking spaces print(ftfy.fix_text('Ã\xa0 perturber la réflexion')) print(ftfy.fix_text('à perturber la réflexion')) # HTML entities print(ftfy.fix_text('P&EACUTE;REZ')) # Unchanged text (avoids false positives) print(ftfy.fix_text('IL Y MARQUÉ…')) ``` -------------------------------- ### Chained Encoding/Decoding Errors Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb This example demonstrates a common scenario where mojibake is created and then exacerbated by repeatedly encoding and decoding text with incompatible encodings (UTF-8 and Windows-1252). ```python text = "l’Hôpital" print(text.encode('utf-8').decode('windows-1252').encode('utf-8').decode('windows-1252')) ``` -------------------------------- ### Fix Encoding Example Source: https://github.com/rspeer/python-ftfy/blob/main/README.md Demonstrates how to use the `fix_encoding` function from the ftfy library to correct text that has been improperly encoded. ```python from ftfy import fix_encoding print(fix_encoding("(ง'⌣')ง")) ``` -------------------------------- ### Avoid False Positives Source: https://github.com/rspeer/python-ftfy/blob/main/docs/index.rst Highlights ftfy's heuristic approach to avoid altering text that is already correct. This example shows text that might appear like mojibake but is intentionally left unchanged. ```python import ftfy ftfy.fix_text('IL Y MARQUÉ…') ``` -------------------------------- ### Fixing MacRoman Mojibake with ftfy Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb This example shows how ftfy can correct text that has been misinterpreted due to MacRoman encoding. It first encodes the string using MacRoman and then decodes it as UTF-8, simulating a common mojibake scenario, before ftfy corrects it. ```python EXAMPLES = [ "Merci de t‚Äö√†√∂¬¨¬©l‚Äö√†√∂¬¨¬©charger le plug-in" ] # Simulate MacRoman mojibake mojibake_text = EXAMPLES[0].encode('macroman').decode('utf-8') # Fix the mojibake using ftfy fixed_text = fix_and_explain(mojibake_text)[0] print(f"Original (simulated MacRoman mojibake): {mojibake_text}") print(f"Fixed by ftfy: {fixed_text}") ``` -------------------------------- ### ftfy Module Documentation Source: https://github.com/rspeer/python-ftfy/blob/main/docs/heuristic.rst Provides an overview of the `ftfy` library's modules and functions related to mojibake detection and fixing. ```APIDOC Module: ftfy Description: Fixes mojibake in text. Submodules: - ftfy.badness: Contains heuristics for detecting mojibake. - Functions: - badness(text: str) -> float: Calculates a 'badness' score for the given text. - is_bad(text: str) -> bool: Returns True if the text is considered 'bad' (mojibake). - ftfy.chardata: Contains regular expressions for mojibake detection. - Constants: - UTF8_DETECTOR_RE: Regex for detecting specific UTF-8 decoding errors. - LOSSY_UTF8_RE: Regex for detecting lossy UTF-8 sequences (with replacements). - ftfy.fixes: Contains functions to fix various types of mojibake. - Functions: - decode_inconsistent_utf8(text: str) -> str: Fixes text with inconsistent UTF-8 decoding. - replace_lossy_sequences(text: str) -> str: Replaces lossy UTF-8 sequences with the replacement character. ``` -------------------------------- ### Highlighting Mojibake Matches with ftfy Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb Illustrates how ftfy identifies potential mojibake using a regular expression. It shows how to find these matches and manually highlight them in a string, demonstrating the 'badness' metric. ```python text = "à perturber la réflexion des théologiens jusqu\'à nos jours" # We want to highlight the matches to this regular expression: ftfy.badness.BADNESS_RE.findall(text) # We'll just highlight it manually: term = blessings.Terminal() highlighted_text = term.on_yellow("à ") + "perturber la r" + term.on_yellow("é") + "flexion des th" + term.on_yellow("é") + "ologiens jusqu\'à nos jours" # Highlighted text shows matches for the 'badness' expression. # If we've confirmed from them that this is mojibake, and there's a consistent fix, we # can fix even text in contexts that were too unclear for the regex, such as the final Ã. print(highlighted_text) print(ftfy.fix_text(highlighted_text)) ``` -------------------------------- ### ftfy Command-line Usage Source: https://github.com/rspeer/python-ftfy/blob/main/docs/cli.rst Provides the usage documentation for the 'ftfy' command-line tool. It outlines the available arguments and options for processing text files, including input/output handling, encoding detection, normalization, and entity preservation. ```text usage: ftfy [-h] [-o OUTPUT] [-g] [-e ENCODING] [-n NORMALIZATION] [--preserve-entities] [filename] ftfy (fixes text for you), version 6.0 positional arguments: filename The file whose Unicode is to be fixed. Defaults to -, meaning standard input. optional arguments: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT The file to output to. Defaults to -, meaning standard output. -g, --guess Ask ftfy to guess the encoding of your input. This is risky. Overrides -e. -e ENCODING, --encoding ENCODING The encoding of the input. Defaults to UTF-8. -n NORMALIZATION, --normalization NORMALIZATION The normalization of Unicode to apply. Defaults to NFC. Can be "none". --preserve-entities Leave HTML entities as they are. The default is to decode them, as long as no HTML tags have appeared in the file. ``` -------------------------------- ### Fix Text and Explain Mojibake Source: https://github.com/rspeer/python-ftfy/blob/main/docs/explain.rst Demonstrates how to use `fix_and_explain` to correct mojibake in text and obtain a step-by-step explanation of the transformations applied. This is useful for understanding encoding issues. ```python from ftfy import fix_and_explain, apply_plan shipping_label = "L&AMP;ATILDE;&AMP;SUP3;PEZ" fixed, explanation = fix_and_explain(shipping_label) print(fixed) # Output: LóPEZ print(explanation) # Output: [('apply', 'unescape_html'), ('apply', 'unescape_html'), ('apply', 'unescape_html'), ('encode', 'latin-1'), ('decode', 'utf-8')] label2 = "CARR&AMP;ATILDE;&AMP;COPY;" print(apply_plan(label2, explanation)) # Output: CARRé ``` -------------------------------- ### ftfy License Requirements Source: https://github.com/rspeer/python-ftfy/blob/main/README.md Outlines the core requirements of the Apache 2.0 license for using and distributing ftfy, including attribution. ```APIDOC If you use or distribute ftfy, you must follow the terms of the [Apache license](https://www.apache.org/licenses/LICENSE-2.0), including that you must attribute the author of ftfy (Robyn Speer) correctly. ``` -------------------------------- ### Opening Files with UTF-8 Encoding in Python Source: https://github.com/rspeer/python-ftfy/blob/main/docs/avoid.rst Demonstrates how to open files in Python 3 using UTF-8 encoding and specifying error handling. This is the recommended approach for most text files. ```python openfile = open(filename, encoding='utf-8', errors='replace') ``` -------------------------------- ### Encode and Decode Text Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb Demonstrates mojibake by encoding a string in Windows-1252 and then decoding it with MacRoman. ```python phrase = "Plus ça change, plus c’est la même chose" phrase.encode('windows-1252').decode('macroman') ``` -------------------------------- ### Import and Version Check Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb Imports the ftfy library and accesses its version attribute. ```python import ftfy ftfy.__version__ ``` -------------------------------- ### Show CP437 Character Table Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb Displays the CP437 character table. ```python show_char_table('cp437') ``` -------------------------------- ### ftfy Configuration Options Source: https://github.com/rspeer/python-ftfy/blob/main/docs/config.rst This section outlines various configuration options for the ftfy library, allowing users to customize text fixing behavior. Options include disabling HTML unescaping, preserving CJK text spacing, managing quotation marks, and controlling UTF-8 decoding. ```APIDOC ftfy.fix_text(text, config=None, **kwargs) Fixes text using a sequence of fixes. ftfy.fix_and_explain(text, config=None, **kwargs) Fixes text and explains the changes made. Configuration Options: unescape_html (bool): If True, unescapes HTML entities. Set to False to preserve HTML. fix_character_width (bool): If True, fixes character width issues, especially for CJK text. Set to False to preserve spacing. uncurl_quotes (bool): If True, replaces typographically correct quotes with standard ones. Set to False to preserve them or use smartypants. decode_inconsistent_utf8 (bool): If True, attempts to fix decoding errors in UTF-8. Set to False for cautious fixing. TextFixerConfig: An object that holds the configuration for ftfy. Can be passed directly to fix_text and fix_and_explain. Keyword arguments can be passed to override default configuration values. ``` -------------------------------- ### Recognizing and Fixing Mojibake with ftfy Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb This section demonstrates the use of the `ftfy.fix_and_explain` function to identify and correct various types of mojibake. It takes a potentially corrupted string, fixes it, and provides an explanation of the detected issues. ```python from ftfy import fix_and_explain from pprint import pprint def show_explanation(text): print(f"Original: {text}") fixed, expl = fix_and_explain(text) print(f" Fixed: {fixed}\n") pprint(expl) EXAMPLES = [ "Merci de t‚Äö√†√∂¬¨¬©l‚Äö√†√∂¬¨¬©charger le plug-in", "The Mona Lisa doesn’t have eyebrows.", "I just figured out how to tweet emojis! âx9a½íxa0½í¸x80íxa0½í¸x81íxa0½í¸" "\x82íxa0½í¸x86íxa0½í¸x8eíxa0½í¸x8eíxa0½í¸x8eíxa0½í¸x8e" ] show_explanation(EXAMPLES[0]) show_explanation(EXAMPLES[1]) show_explanation(EXAMPLES[2]) ``` -------------------------------- ### Show ASCII Character Table Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb Displays the ASCII character table using the utility function. ```python show_char_table("ascii") ``` -------------------------------- ### Show MacRoman Character Table Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb Displays the MacRoman character table. ```python show_char_table('macroman') ``` -------------------------------- ### ftfy.ExplainedText Class Source: https://github.com/rspeer/python-ftfy/blob/main/docs/explain.rst Describes the structure of the NamedTuple returned by `fix_and_explain` and `fix_encoding_and_explain` functions. It contains the fixed text and a list of applied transformations. ```python from ftfy import fix_and_explain text = "Some text" fixed, explanation = fix_and_explain(text) # The 'explanation' variable is an instance of ExplainedText print(type(explanation)) # Output: print(explanation.fixed) # Output: Some text print(explanation.explanation) # Output: [] ``` -------------------------------- ### ftfy Citation for Research Source: https://github.com/rspeer/python-ftfy/blob/main/README.md Provides the recommended citation format for the ftfy library in research contexts, including version and DOI. ```APIDOC Robyn Speer. (2019). ftfy (Version 5.5). Zenodo. http://doi.org/10.5281/zenodo.2591652 ``` -------------------------------- ### ftfy.fix_and_explain Function Source: https://github.com/rspeer/python-ftfy/blob/main/docs/explain.rst Similar to `fix_text`, but also returns a detailed explanation of the transformations performed on the entire text. It does not process text line by line, aiming for a unified explanation. ```python from ftfy import fix_and_explain text_to_explain = "Another example with \'\' fixed, explanation = fix_and_explain(text_to_explain) print(fixed) # Example Output: Another example with ' print(explanation) # Example Output: [('apply', 'unescape_html'), ('apply', 'unescape_html')] ``` -------------------------------- ### ftfy Test Case Structure Source: https://github.com/rspeer/python-ftfy/blob/main/tests/test-cases/README.md Defines the structure of a test case JSON file for the ftfy library. It includes fields for labeling, commenting, original text, and expected fixed text. ```json { "label": "A description of the test case.", "comment": "Further details on the test case.", "original": "The text to run through ftfy.", "fixed-encoding": "(optional) The expected result of ftfy.fix_encoding(original)", "fixed": "The expected result of ftfy.fix_text(original)", "expect": "pass | fail" } ``` -------------------------------- ### Show Windows-1251 Character Table Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb Displays the Windows-1251 character table. ```python show_char_table("windows-1251") ``` -------------------------------- ### Displaying Mojibake from DOS NFO Files Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb This snippet shows how to interpret and display text that was originally encoded in CP437 (common in DOS) and then potentially misinterpreted as Windows-1252. It highlights the 'vintage mojibake' often seen in .NFO files. ```python crack_nfo = r""" ───── ────────────── ───────────────── ────────────── ─────────────── ──── ▄█████▄ ▀█████▄ ████▄ ▀█████████ ████▄ ▄█████▄▀████▄ ▀██████▄ ▀████▄ ▄██▄ ████████ █████▀ ██████ ▀████████ ██████ ▀███████▌██████ ███████ ████▄█████ ███ ▀███▌█▀ ▄▄▄█▌▀█████ ▌████ ▄▄▄█▀▐████ ███▀▌▀█▌█▌ ███▌ ██▀▐████ ██████████ ███ ▐██▌▌ ████ ▌████ ████ ████ ████ ██▌ ▐▌█▌ ████ ██ ████ ██▌▐█▌▐███ ███ ▄███▌▄▄ ████ ████ ████ ████ ▄████ ██▌ █▄▄█▌ ████ ██ ▄████ ██ █ ███ ████████ ██ ████ ████ ████ ████ █████ ██▌ ▀▀██▌▐███▀ ██▐█████ ██ ▄ ███ ██████▀ ▄▀▀ ████ ████ ████ ████ ▀████ █████▄▐██▄▀▀ ▄███ ▀████ ██ ▄ ███ ███▀ ▄▄██ ████ ▐████ ████ ████ ████ ██▌▐▐██▌█▀██▄ ████ ████ ██ ███ ███ █████▄ ▀▀█ ████▀ ▐████ ░███ ▐████ ███▄▐██▌█▌▐██▌▐███ ▐███░ ██▌ r███ ░██ █████████▄ ▐██▌▄██████ ▒░██ █████ ███████▌█▌▐███ ███ ███░▒ ███ o██░ ▒░█ ▀███████▀ ▀ ███▐████▀ ▓▒░█ ▐███▌ ▀██████▐█▌▐███ ███ ▐█░▒▓ ██▌ y█░▒ - ▌─────▐▀─ ▄▄▄█ ── ▀▀ ───── ────── ▀▀▀ ─ ▐▀▀▀▀ ▀▀ ████ ──── ▀█▀ ─ ▀ ────▐ ─ ╓────────────────────────[ RELEASE INFORMATION ]───────────────────────╖ ╓────────────────────────────────────────────────────────────────────────────╖ ║ -/- THE EVEN MORE INCREDIBLE MACHINE FOR *DOS* FROM SIERRA/DYNAMIX -/- ║ ╙────────────────────────────────────────────────────────────────────────────╜ """) print(crack_nfo.encode('cp437').decode('windows-1252')) ``` -------------------------------- ### Badness Heuristic Functions - ftfy.badness Source: https://github.com/rspeer/python-ftfy/blob/main/docs/heuristic.rst Provides functions for calculating the 'badness' of text, a heuristic for detecting mojibake. It includes the main `badness` function and `is_bad` for checking if text is considered bad. ```python import ftfy.badness # Example usage: text = "This is some text." score = ftfy.badness.badness(text) print(f"Badness score: {score}") is_text_bad = ftfy.badness.is_bad(text) print(f"Is text bad? {is_text_bad}") ``` -------------------------------- ### Show Windows-1252 Character Table Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb Displays the Windows-1252 character table. ```python show_char_table('windows-1252') ``` -------------------------------- ### Fix Multiple Layers of Mojibake Source: https://github.com/rspeer/python-ftfy/blob/main/docs/index.rst Shows how ftfy can resolve text that has undergone several layers of encoding corruption, a common issue with legacy systems or complex data pipelines. ```python import ftfy ftfy.fix_text('The Mona Lisa doesn’t have eyebrows.') ``` -------------------------------- ### Show Latin-1 Character Table Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb Displays the Latin-1 character table, highlighting the 'here be dragons' section. ```python show_char_table('latin-1') ``` -------------------------------- ### Display Character Table Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb A utility function to display character tables for different encodings, highlighting printable and control characters. It uses the blessings library for terminal formatting and ftfy.formatting for centering text. ```python import blessings term = blessings.Terminal() # enable colorful text def displayable_codepoint(codepoint, encoding): char = bytes([codepoint]).decode(encoding, 'replace') if char == '': return '▓▓' elif not char.isprintable(): return '░░' else: return char def show_char_table(encoding): print(f"encoding: {encoding}\n 0 1 2 3 4 5 6 7 8 9 a b c d e f\n") for row in range(16): print(f"{row*16:>02x}", end=" ") if row == 0: print(ftfy.formatting.display_center(term.green(" control characters "), 32, "░")) elif row == 8 and encoding == 'latin-1': print(ftfy.formatting.display_center(term.green(" here be dragons "), 32, "░")) else: for col in range(16): char = displayable_codepoint(row * 16 + col, encoding) print(f"{char:<2}", end="") print() ``` -------------------------------- ### UTF-8 Character Encoding Visualization Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb This function demonstrates how characters are represented in UTF-8 by encoding them and displaying the resulting byte sequence. It helps visualize the multi-byte nature of UTF-8 for non-ASCII characters. ```python # Code to look at the encoding of each character in UTF-8 def show_utf8(text): for char in text: char_bytes = char.encode('utf-8') byte_sequence = ' '.join([f"{byte:>02x}" for byte in char_bytes]) print(f"{char} = {byte_sequence}") text = "l’Hôpital" show_utf8(text) ``` -------------------------------- ### Fix Mojibake Text Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb Demonstrates the primary function of the ftfy library by fixing a string containing mojibake. ```python ftfy.fix_text("merci de t‚Äö√†√∂¬¨¬©l‚Äö√†√∂¬¨¬©charger le plug-in") ``` -------------------------------- ### Decoding Bytes to Text as UTF-8 in Python Source: https://github.com/rspeer/python-ftfy/blob/main/docs/avoid.rst Shows how to decode a byte buffer into a text string using UTF-8 encoding in Python. This is used when converting raw bytes to readable text. ```python text = bytebuffer.decode('utf-8', 'replace') ``` -------------------------------- ### Fix Mojibake with Non-Breaking Spaces Source: https://github.com/rspeer/python-ftfy/blob/main/docs/index.rst Demonstrates ftfy's handling of mojibake where non-breaking spaces (U+A0) were incorrectly converted to ASCII spaces, potentially leading to multiple spaces. ```python import ftfy ftfy.fix_text('Ã\xa0 perturber la réflexion') ftfy.fix_text('à perturber la réflexion') ``` -------------------------------- ### ftfy Restrictions for AI Training Source: https://github.com/rspeer/python-ftfy/blob/main/README.md Specifies restrictions on creating derived works from ftfy, particularly those involving AI training datasets or obscuring authorship. ```APIDOC You _may not_ make a derived work of ftfy that obscures its authorship, such as by putting its code in an AI training dataset, including the code in AI training at runtime, or using a generative AI that copies code from such a dataset. ``` -------------------------------- ### Fix Mojibake (Encoding Mix-ups) Source: https://github.com/rspeer/python-ftfy/blob/main/docs/index.rst Demonstrates ftfy's ability to correct text where characters were misinterpreted due to incorrect encoding. It handles common UTF-8 decoding errors. ```python import ftfy ftfy.fix_text('✔ No problems') ``` -------------------------------- ### ftfy BibTeX Citation Source: https://github.com/rspeer/python-ftfy/blob/main/README.md Presents the citation for the ftfy library in BibTeX format, suitable for LaTeX documents. ```BibTeX @misc{speer-2019-ftfy, author = {Robyn Speer}, title = {ftfy}, note = {Version 5.5}, year = 2019, howpublished = {Zenodo}, doi = {10.5281/zenodo.2591652}, url = {https://doi.org/10.5281/zenodo.2591652} } ``` -------------------------------- ### ftfy.explain_unicode Function Source: https://github.com/rspeer/python-ftfy/blob/main/docs/explain.rst A utility function that breaks down a string into its constituent Unicode characters, providing details about each character. Useful for understanding the composition of strings. ```python from ftfy import explain_unicode unicode_string = "Héllo" explanation = explain_unicode(unicode_string) print(explanation) # Output: H (U+0048) é (U+00E9) l (U+006C) l (U+006C) o (U+006F) ``` -------------------------------- ### ftfy.fix_encoding and ftfy.fix_encoding_and_explain Functions Source: https://github.com/rspeer/python-ftfy/blob/main/docs/explain.rst These functions specifically address encoding and decoding problems, excluding other issues like HTML entities. `fix_encoding` returns only the fixed string, while `fix_encoding_and_explain` provides an explanation. ```python from ftfy import fix_encoding, fix_encoding_and_explain encoding_issue_text = "\xc3\xa9cole" fixed_string = fix_encoding(encoding_issue_text) print(fixed_string) # Output: école fixed_str, explanation = fix_encoding_and_explain(encoding_issue_text) print(fixed_str) # Output: école print(explanation) # Output: [('encode', 'latin-1'), ('decode', 'utf-8')] ``` -------------------------------- ### Fix Mojibake with Curly Quotes Source: https://github.com/rspeer/python-ftfy/blob/main/docs/index.rst Illustrates ftfy's capability to fix mojibake that has been combined with the incorrect rendering of 'curly quotes', requiring a two-step correction process. ```python import ftfy ftfy.fix_text("l’humanité") ``` -------------------------------- ### Lossy UTF-8 Heuristic - ftfy.chardata Source: https://github.com/rspeer/python-ftfy/blob/main/docs/heuristic.rst Describes the `LOSSY_UTF8_RE` regular expression in `chardata.py`. This heuristic targets sequences that appear to be incorrectly decoded UTF-8, where characters are replaced by question marks or the Unicode replacement character (''). It's utilized by `ftfy.fixes.replace_lossy_sequences`. ```python import ftfy.fixes # Assuming LOSSY_UTF8_RE is accessible or its logic is demonstrated # This is a conceptual example as the regex itself is not provided in the text. # In practice, you would use ftfy.fixes.replace_lossy_sequences directly. # Example of how it might be used (conceptual): # text_with_replacements = "Some text with characters." # cleaned_text = ftfy.fixes.replace_lossy_sequences(text_with_replacements) print("The lossy UTF-8 heuristic uses the regex LOSSY_UTF8_RE in chardata.py.") print("It replaces sequences with '?' or '' with the replacement character.") ``` -------------------------------- ### UTF-8 Detector Heuristic - ftfy.chardata Source: https://github.com/rspeer/python-ftfy/blob/main/docs/heuristic.rst Details the `UTF8_DETECTOR_RE` regular expression in `chardata.py`. This heuristic identifies specific sequences of mojibake characters resulting from common UTF-8 decoding errors. It's used in `ftfy.fixes.decode_inconsistent_utf8`. ```python import re import ftfy.fixes # Assuming UTF8_DETECTOR_RE is accessible or its logic is demonstrated # This is a conceptual example as the regex itself is not provided in the text. # In practice, you would use ftfy.fixes.decode_inconsistent_utf8 directly. # Example of how it might be used (conceptual): # mojibake_text = "..." # fixed_text = ftfy.fixes.decode_inconsistent_utf8(mojibake_text) print("The UTF-8 detector heuristic uses the regex UTF8_DETECTOR_RE in chardata.py.") print("It helps fix text that has specific UTF-8 decoding errors.") ``` -------------------------------- ### UTF-8 to Windows-1252 Conversion Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb This snippet shows the result of encoding a string containing accented characters into UTF-8 and then attempting to decode it using the Windows-1252 encoding. This often results in mojibake. ```python text = "l’Hôpital" print(text.encode('utf-8').decode('windows-1252')) ``` -------------------------------- ### ftfy.fix_encoding Source: https://github.com/rspeer/python-ftfy/blob/main/docs/config.rst A specialized function to detect and repair decoding errors (mojibake) in text. While it can be used independently, mojibake might be entangled with other text issues, and limiting the process to this step could make some mojibake unfixable. ```python import ftfy fixed_text = ftfy.fix_encoding(text_with_mojibake) ``` -------------------------------- ### Decode HTML Entities Outside HTML Source: https://github.com/rspeer/python-ftfy/blob/main/docs/index.rst Shows ftfy's ability to correctly decode HTML entities that appear in plain text contexts, even when they are not properly formed or capitalized according to standards. ```python import ftfy # by the HTML 5 standard, only 'PÉREZ' is acceptable ftfy.fix_text('P&EACUTE;REZ') ``` -------------------------------- ### ftfy.fixes Module Functions Source: https://github.com/rspeer/python-ftfy/blob/main/docs/fixes.rst This section details the various text-fixing functions available in the ftfy.fixes module. These functions address specific text normalization tasks, such as decoding escaped characters, fixing inconsistent UTF-8 encoding, handling control characters, and uncurling quotes. ```APIDOC ftfy.fixes: decode_escapes(text: str) -> str Decodes escape sequences within a string. decode_inconsistent_utf8(text: str) -> str Fixes strings with inconsistent or incorrect UTF-8 encoding. fix_c1_controls(text: str) -> str Replaces C1 control characters with appropriate replacements. fix_character_width(text: str) -> str Normalizes character widths, particularly for East Asian characters. fix_latin_ligatures(text: str) -> str Replaces common Latin ligatures with their constituent characters. fix_line_breaks(text: str) -> str Normalizes different types of line break characters. fix_surrogates(text: str) -> str Handles and potentially replaces surrogate characters. remove_control_chars(text: str) -> str Removes common control characters from a string. remove_terminal_escapes(text: str) -> str Removes ANSI escape codes often found in terminal output. replace_lossy_sequences(text: str) -> str Replaces sequences that may have been corrupted during encoding/decoding. restore_byte_a0(text: str) -> str Restores the correct representation of the non-breaking space character (U+00A0). uncurl_quotes(text: str) -> str Replaces curly quotation marks with standard straight quotes. unescape_html(text: str) -> str Unescapes HTML entities within a string. ``` -------------------------------- ### Fix Mascot Text Source: https://github.com/rspeer/python-ftfy/blob/main/notebook/ftfy talk.ipynb Fixes a string containing mojibake that represents a character. ```python ftfy.fix_text("(Ãxa0¸‡'̀⌣'ÃŒÂx81)Ãxa0¸‡") ``` -------------------------------- ### Guess Bytes Encoding Source: https://github.com/rspeer/python-ftfy/blob/main/docs/detect.rst This function attempts to guess the encoding of a byte sequence. It relies on strong signals like UTF-16 byte-order marks or successful UTF-8 decoding. It cannot guess non-Unicode CJK encodings such as Shift-JIS or Big5. ```python def guess_bytes(bytes): """Guess the encoding of a byte sequence. This function attempts to be less terrible than other byte-encoding-guessers in common cases. Instead of using probabilistic heuristics, it picks up on very strong signals like "having a UTF-16 byte-order mark" or "decoding successfully as UTF-8". This function won't solve everything. It can't solve everything. In particular, it has no capacity to guess non-Unicode CJK encodings such as Shift-JIS or Big5. """ pass ``` -------------------------------- ### ftfy.fix_text Function Source: https://github.com/rspeer/python-ftfy/blob/main/docs/explain.rst The primary function for fixing text encoding issues. It processes the input string, applies all possible fixes, and returns the cleaned text. It operates on lines of text independently. ```python from ftfy import fix_text text_with_errors = "This text has some encoding issues like \xe2\x80\x93 a dash." fixed_text = fix_text(text_with_errors) print(fixed_text) # Example Output: This text has some encoding issues like – a dash. ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.