Try Live
Add Docs
Rankings
Pricing
Enterprise
Docs
Install
Theme
Install
Docs
Pricing
Enterprise
More...
More...
Try Live
Rankings
Create API Key
Add Docs
Wiktextract
https://github.com/tatuylonen/wiktextract
Admin
Wiktextract is a Python package and utility for extracting structured multilingual data from
...
Tokens:
29,993
Snippets:
229
Trust Score:
6.8
Update:
1 month ago
Context
Skills
Chat
Benchmark
85.5
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# Wiktextract Wiktextract is a Python package and utility for extracting structured, machine-readable data from Wiktionary dump files. It parses XML dump files from various Wiktionary language editions (most notably English Wiktionary) and outputs comprehensive lexical data in JSONL format, including word definitions, parts of speech, translations, pronunciations (with IPA and audio links), etymologies, inflection tables, synonyms, antonyms, and other linguistic relationships. The tool stands out by expanding Wiktionary's templates and Lua macros during extraction, enabling superior accuracy and completeness compared to simple text parsing approaches. The package supports extraction from multiple Wiktionary editions including English, German, French, Spanish, Chinese, Japanese, Korean, and many others. It provides both a command-line interface (`wiktwords`) for straightforward batch processing and a Python API for programmatic integration. The extracted data is useful for natural language processing, machine translation, semantic parsing, language generation applications, and building multilingual dictionaries with declension/conjugation information. ## WiktionaryConfig The `WiktionaryConfig` class controls what data to capture during extraction and collects statistics. It specifies which languages to process, which data types to extract (translations, pronunciations, etymologies, etc.), and maintains error/warning logs during processing. ```python from wiktextract import WiktionaryConfig # Create configuration for extracting all data types for English and Translingual config = WiktionaryConfig( dump_file_lang_code="en", # Wiktionary edition language code capture_language_codes=["en", "mul", "de"], # Languages to extract (None = all) capture_translations=True, # Extract translation data capture_pronunciation=True, # Extract IPA, audio files, etc. capture_linkages=True, # Extract synonyms, antonyms, etc. capture_compounds=True, # Extract compound words capture_redirects=True, # Extract redirect pages capture_examples=True, # Extract usage examples capture_etymologies=True, # Extract etymology information capture_inflections=True, # Extract inflection tables capture_descendants=True, # Extract descendant words verbose=False, # Enable verbose logging expand_tables=False, # Expand inflection tables to text ) # Access statistics after processing print(f"Pages processed: {config.num_pages}") print(f"Languages found: {dict(config.language_counts)}") print(f"Parts of speech: {dict(config.pos_counts)}") print(f"Errors: {len(config.errors)}") print(f"Warnings: {len(config.warnings)}") ``` ## WiktextractContext The `WiktextractContext` class is the main processing context that combines the wikitextprocessor's `Wtp` context with `WiktionaryConfig`. It manages database connections, thesaurus data, and provides the environment for page parsing operations. ```python from wiktextract import WiktextractContext, WiktionaryConfig from wikitextprocessor import Wtp # Create the wikitextprocessor context wtp = Wtp( db_path="wiktionary.db", # SQLite database path for caching lang_code="en", # Wiktionary edition language quiet=False, # Show progress messages ) # Create wiktextract configuration config = WiktionaryConfig( dump_file_lang_code="en", capture_language_codes=None, # Capture all languages capture_translations=True, capture_pronunciation=True, capture_linkages=True, capture_examples=True, capture_etymologies=True, capture_inflections=True, capture_descendants=True, ) # Create the extraction context wxr = WiktextractContext(wtp, config) # Context provides access to current processing state wxr.lang = "English" # Current language being processed wxr.word = "example" # Current word being processed wxr.pos = "noun" # Current part of speech # Reconnect databases (useful in multiprocessing scenarios) wxr.reconnect_databases(check_same_thread=True) # Clean up before passing to worker processes wxr.remove_unpicklable_objects() ``` ## parse_wiktionary The `parse_wiktionary` function is the primary entry point for processing entire Wiktionary dump files. It handles the two-phase extraction process: first parsing templates and modules, then extracting word data in parallel. ```python from wiktextract import WiktextractContext, WiktionaryConfig, parse_wiktionary from wikitextprocessor import Wtp # Set up the extraction context config = WiktionaryConfig( dump_file_lang_code="en", capture_language_codes=["en", "mul"], capture_translations=True, capture_pronunciation=True, capture_linkages=True, capture_examples=True, capture_etymologies=True, capture_inflections=True, capture_descendants=True, ) wtp = Wtp(db_path="enwiktionary.db", lang_code="en") wxr = WiktextractContext(wtp, config) # Define namespace IDs to process (Main=0, Template=10, Module=828) namespace_ids = { wxr.wtp.NAMESPACE_DATA.get(name, {}).get("id", 0) for name in ["Main", "Template", "Module"] } # Process the dump file and write to output with open("output.jsonl", "w", encoding="utf-8") as out_f: parse_wiktionary( wxr=wxr, dump_path="enwiktionary-20240101-pages-articles.xml.bz2", num_processes=8, # Parallel processes (None = auto) phase1_only=False, # False = full extraction namespace_ids=namespace_ids, # Namespaces to process out_f=out_f, # Output file object human_readable=False, # True = pretty-printed JSON override_folders=None, # Override page folders skip_extract_dump=False, # Skip if DB exists save_pages_path=None, # Save extracted pages ) # Output: JSONL file with one JSON object per line # {"word": "example", "lang": "English", "lang_code": "en", "pos": "noun", ...} ``` ## reprocess_wiktionary The `reprocess_wiktionary` function processes pages from an existing SQLite database without re-parsing the dump file. This is useful for re-extracting data with different settings or debugging specific pages. ```python from wiktextract import WiktextractContext, WiktionaryConfig, reprocess_wiktionary from wikitextprocessor import Wtp # Load from existing database wtp = Wtp(db_path="enwiktionary.db", lang_code="en") config = WiktionaryConfig( dump_file_lang_code="en", capture_language_codes=["en"], capture_translations=True, capture_pronunciation=True, ) wxr = WiktextractContext(wtp, config) with open("reprocessed.jsonl", "w", encoding="utf-8") as out_f: reprocess_wiktionary( wxr=wxr, num_processes=4, out_f=out_f, human_readable=False, search_pattern=None, # Optional: filter pages by pattern ) # Using search_pattern to filter pages # search_pattern uses SQL LIKE syntax: % = any chars, _ = single char with open("english_only.jsonl", "w", encoding="utf-8") as out_f: reprocess_wiktionary( wxr=wxr, num_processes=4, out_f=out_f, human_readable=True, search_pattern="%==English==%", # Only pages with English section ) ``` ## parse_page The `parse_page` function extracts data from a single Wiktionary page given its title and wikitext content. This is useful for testing, debugging, or processing individual pages. ```python from wiktextract import WiktextractContext, WiktionaryConfig, parse_page from wikitextprocessor import Wtp # Set up context wtp = Wtp(db_path="enwiktionary.db", lang_code="en") config = WiktionaryConfig( dump_file_lang_code="en", capture_language_codes=["en"], capture_translations=True, capture_pronunciation=True, capture_linkages=True, capture_examples=True, capture_etymologies=True, ) wxr = WiktextractContext(wtp, config) # Example Wiktionary page content page_text = """ ==English== ===Etymology=== From {{inh|en|enm|word}}, from {{inh|en|ang|word}}. ===Pronunciation=== * {{IPA|en|/wɜːd/}} * {{audio|en|en-us-word.ogg|Audio (US)}} ===Noun=== {{en-noun}} # A unit of language. #: {{ux|en|Be careful with your '''words'''.}} # A promise. #: {{syn|en|promise|vow}} ===Verb=== {{en-verb}} # To say or write using words. ====Synonyms==== * {{l|en|express}} * {{l|en|phrase}} """ # Parse the page results = parse_page(wxr, "word", page_text) # Results is a list of dictionaries, one per word/POS combination for entry in results: print(f"Word: {entry.get('word')}") print(f"Language: {entry.get('lang')}") print(f"POS: {entry.get('pos')}") print(f"Senses: {len(entry.get('senses', []))}") for sense in entry.get('senses', []): print(f" - {sense.get('glosses', [''])[0]}") print(f"Sounds: {entry.get('sounds', [])}") print(f"Etymology: {entry.get('etymology_text', '')[:100]}...") print("---") # Example output structure: # { # "word": "word", # "lang": "English", # "lang_code": "en", # "pos": "noun", # "etymology_text": "From Middle English word, from Old English word.", # "sounds": [{"ipa": "/wɜːd/"}, {"audio": "en-us-word.ogg", ...}], # "senses": [ # {"glosses": ["A unit of language."], "examples": [...]}, # {"glosses": ["A promise."], "synonyms": [...]} # ], # "forms": [{"form": "words", "tags": ["plural"]}] # } ``` ## extract_thesaurus_data The `extract_thesaurus_data` function processes Wiktionary's Thesaurus namespace pages to extract synonym, antonym, and other word relationship data. The extracted data is stored in a SQLite database and can be merged into main word entries. ```python from wiktextract import WiktextractContext, WiktionaryConfig, extract_thesaurus_data from wikitextprocessor import Wtp # Enable thesaurus extraction in config config = WiktionaryConfig( dump_file_lang_code="en", capture_language_codes=["en"], capture_linkages=True, ) config.extract_thesaurus_pages = True # Enable thesaurus processing wtp = Wtp(db_path="enwiktionary.db", lang_code="en") wxr = WiktextractContext(wtp, config) # Extract thesaurus data (requires dump to be already processed) extract_thesaurus_data(wxr, num_processes=4) # Query thesaurus data programmatically from wiktextract.thesaurus import search_thesaurus # Find synonyms for "happy" as an adjective for term in search_thesaurus( wxr.thesaurus_db_conn, entry="happy", lang_code="en", pos="adj", linkage_type="synonyms", # Optional: filter by linkage type ): print(f"Term: {term.term}") print(f"Linkage: {term.linkage}") print(f"Tags: {term.tags}") print(f"Sense: {term.sense}") # ThesaurusTerm dataclass fields: # - entry: str # Main word entry # - language_code: str # Language code # - pos: str # Part of speech # - linkage: str # Relationship type (synonyms, antonyms, etc.) # - term: str # Related term # - tags: list[str] # Qualifier tags # - raw_tags: list[str] # Unparsed tags # - topics: list[str] # Topic categories # - roman: str # Romanization # - sense: str # Sense description ``` ## extract_categories The `extract_categories` function extracts the Wiktionary category tree hierarchy by evaluating Lua modules. This provides structured access to the category relationships defined in Wiktionary. ```python from wiktextract import WiktextractContext, WiktionaryConfig, extract_categories from wikitextprocessor import Wtp import json # Set up context (requires processed dump with Lua modules) wtp = Wtp(db_path="enwiktionary.db", lang_code="en") config = WiktionaryConfig(dump_file_lang_code="en") wxr = WiktextractContext(wtp, config) # Extract category tree category_tree = extract_categories(wxr) # Structure of returned data: # { # "roots": ["Fundamental", ...], # Top-level categories # "nodes": { # "category name": { # "name": "Category Name", # "desc": "Description with {{templates}}", # "clean_desc": "Cleaned description text", # "children": ["child1", "child2"], # "sort": ["sort key 1", "sort key 2"] # } # } # } # Save to file with open("categories.json", "w", encoding="utf-8") as f: json.dump(category_tree, f, indent=2, sort_keys=True) # Access category information print(f"Root categories: {category_tree['roots']}") print(f"Total categories: {len(category_tree['nodes'])}") # Example: find all children of "Emotions" category emotions = category_tree['nodes'].get('emotions', {}) print(f"Emotions subcategories: {emotions.get('children', [])}") ``` ## extract_namespace The `extract_namespace` function exports all pages from a specific namespace (like Template or Module) to a tar archive. This is useful for backing up or analyzing Wiktionary's template/module infrastructure. ```python from wiktextract import WiktextractContext, WiktionaryConfig, extract_namespace from wikitextprocessor import Wtp # Set up context wtp = Wtp(db_path="enwiktionary.db", lang_code="en") config = WiktionaryConfig(dump_file_lang_code="en") wxr = WiktextractContext(wtp, config) # Extract all templates to a tar file extract_namespace(wxr, "Template", "templates.tar") # Extract all Lua modules to a tar file extract_namespace(wxr, "Module", "modules.tar") # The tar file structure: # templates.tar/ # Template/ # en-noun.txt # IPA.txt # ... # # Each file contains the wikitext source of the template/module ``` ## Command-Line Interface (wiktwords) The `wiktwords` command provides a comprehensive CLI for extracting data from Wiktionary dumps without writing Python code. ```bash # Basic extraction for all languages with all data types wiktwords --all --all-languages --out data.jsonl \ --edition en enwiktionary-20240101-pages-articles.xml.bz2 # Extract only English and German with specific data types wiktwords --language-code en --language-code de \ --translations --pronunciations --linkages \ --out english_german.jsonl --edition en \ enwiktionary-20240101-pages-articles.xml.bz2 # Create database for faster subsequent processing wiktwords --db-path enwikt.db --edition en \ enwiktionary-20240101-pages-articles.xml.bz2 # Process single page for debugging (with existing database) wiktwords --db-path enwikt.db --edition en --all --all-languages \ --out word_debug.json --page "word" --human-readable # Process page from file # File format: first line "TITLE: page_title", rest is wikitext wiktwords --db-path enwikt.db --edition en --all \ --out test.json --page test_page.txt --human-readable # Extract with parallel processing control wiktwords --all --all-languages --num-processes 16 \ --out data.jsonl --edition en dump.xml.bz2 # Extract category tree, templates, and modules wiktwords --db-path enwikt.db --edition en \ --categories-file categories.json \ --templates-file templates.tar \ --modules-file modules.tar \ --skip-extraction dump.xml.bz2 # Filter pages by pattern (with existing database) wiktwords --db-path enwikt.db --edition en --all \ --search-pattern "%==English==%==Noun==%" \ --out english_nouns.jsonl # Run extraction for non-English Wiktionary wiktwords --all --all-languages --out fr_data.jsonl \ --edition fr frwiktionary-20240101-pages-articles.xml.bz2 # Using container (Podman/Docker) podman run -v /data:/data -it --rm ghcr.io/tatuylonen/wiktextract \ --all --all-languages --out /data/output.jsonl \ --edition en /data/enwiktionary-20240101-pages-articles.xml.bz2 ``` ## Reading Extracted JSONL Data The extracted data is in JSONL format (one JSON object per line). Here are examples of reading and processing the data. ```python import json from collections import defaultdict # Read JSONL file line by line (memory efficient) def read_wiktextract_data(filename): with open(filename, encoding="utf-8") as f: for line in f: yield json.loads(line) # Example: count entries by language language_counts = defaultdict(int) for entry in read_wiktextract_data("data.jsonl"): lang = entry.get("lang", "Unknown") language_counts[lang] += 1 print(dict(language_counts)) # Example: extract all English nouns with their definitions english_nouns = [] for entry in read_wiktextract_data("data.jsonl"): if entry.get("lang_code") == "en" and entry.get("pos") == "noun": word_data = { "word": entry["word"], "definitions": [ sense.get("glosses", [""])[0] for sense in entry.get("senses", []) ], "forms": entry.get("forms", []), } english_nouns.append(word_data) # Example: find words with specific pronunciation def find_by_ipa(data_file, ipa_pattern): import re pattern = re.compile(ipa_pattern) for entry in read_wiktextract_data(data_file): for sound in entry.get("sounds", []): if "ipa" in sound and pattern.search(sound["ipa"]): yield entry break # Find words with /θ/ sound for entry in find_by_ipa("data.jsonl", r"/.*θ.*"): print(f"{entry['word']}: {entry.get('sounds', [])}") # Example: build translation dictionary translations = defaultdict(list) for entry in read_wiktextract_data("data.jsonl"): if entry.get("lang_code") != "en": continue word = entry["word"] for trans in entry.get("translations", []): if trans.get("lang_code") == "de": # German translations translations[word].append({ "german": trans.get("word"), "sense": trans.get("sense", ""), }) # Pretty-print a single entry for inspection def pretty_print_entry(entry): print(json.dumps(entry, indent=2, sort_keys=True, ensure_ascii=False)) # Example entry structure: example_entry = { "word": "thrill", "lang": "English", "lang_code": "en", "pos": "verb", "etymology_text": "From Middle English thrillen...", "etymology_templates": [ {"name": "inh", "args": {"1": "en", "2": "enm", "3": "thrillen"}, "expansion": "Middle English thrillen"} ], "sounds": [ {"ipa": "/θɹɪl/"}, {"audio": "en-us-thrill.ogg", "mp3_url": "...", "ogg_url": "..."} ], "forms": [ {"form": "thrills", "tags": ["present", "singular", "third-person"]}, {"form": "thrilling", "tags": ["present", "participle"]}, {"form": "thrilled", "tags": ["past", "participle"]} ], "senses": [ { "glosses": ["To suddenly excite someone..."], "tags": ["ergative", "figuratively"], "examples": [{"text": "The movie thrilled audiences."}], "synonyms": [{"word": "excite"}, {"word": "electrify"}] } ], "translations": [ {"lang": "German", "lang_code": "de", "word": "begeistern", "sense": "to excite"} ] } ``` ## Data Structure Reference The extracted JSON entries follow a consistent structure with these key fields. ```python # Complete field reference for word entries word_entry = { # Required fields "word": "example", # The headword "lang": "English", # Language name "lang_code": "en", # ISO language code "pos": "noun", # Part of speech # Senses (at least one required) "senses": [ { "glosses": ["Definition text"], "raw_glosses": ["Definition with qualifiers"], "tags": ["formal", "archaic"], "categories": ["English nouns"], "topics": ["linguistics"], "examples": [ { "text": "Example sentence", "ref": "Source reference", "english": "Translation if non-English", "type": "quotation", # or "example" } ], "synonyms": [{"word": "synonym", "tags": ["informal"]}], "antonyms": [{"word": "antonym"}], "hypernyms": [{"word": "hypernym"}], "hyponyms": [{"word": "hyponym"}], "meronyms": [{"word": "part"}], "holonyms": [{"word": "whole"}], "coordinate_terms": [{"word": "sibling"}], "derived": [{"word": "derived_word"}], "related": [{"word": "related_word"}], "alt_of": [{"word": "main_form", "extra": "notes"}], "form_of": [{"word": "lemma", "extra": "inflection type"}], "wikidata": ["Q12345"], "wikipedia": ["Article_name"], } ], # Pronunciation "sounds": [ {"ipa": "/ɪɡˈzæmpəl/", "tags": ["UK"]}, {"ipa": "[ɪɡˈzæmpəɫ]", "tags": ["US"]}, {"enpr": "ĭg-zăm′-pəl"}, {"audio": "en-us-example.ogg", "ogg_url": "https://...", "mp3_url": "https://..."}, {"rhymes": "-æmpəl"}, {"homophones": ["homophone1"]}, {"hyphenation": ["ex", "am", "ple"]}, ], # Forms/inflections "forms": [ {"form": "examples", "tags": ["plural"]}, {"form": "exampled", "tags": ["past"]}, ], # Etymology "etymology_text": "From Latin exemplum...", "etymology_number": 1, # If multiple etymologies "etymology_templates": [ {"name": "der", "args": {"1": "en", "2": "la", "3": "exemplum"}, "expansion": "Latin exemplum"} ], # Descendants "descendants": [ {"depth": 1, "text": "French: exemple", "templates": [{"name": "desc", "args": {...}}]} ], # Translations (usually on English entries) "translations": [ {"lang": "French", "lang_code": "fr", "word": "exemple", "sense": "something representative", "tags": ["masculine"]} ], # Categories and metadata "categories": ["English nouns", "English terms derived from Latin"], "topics": ["linguistics", "grammar"], "wikidata": ["Q12345"], "wikipedia": ["Example"], # Templates (for advanced processing) "head_templates": [{"name": "en-noun", "args": {}, "expansion": "..."}], "inflection_templates": [{"name": "en-noun", "args": {}}], } # Redirect entry structure redirect_entry = { "title": "colour", "redirect": "color", "pos": "hard-redirect", } ``` ## Summary Wiktextract serves as a comprehensive solution for extracting machine-readable lexical data from Wiktionary. Its primary use cases include building multilingual dictionaries for NLP applications, creating training data for machine translation systems, constructing knowledge graphs of word relationships (synonyms, antonyms, hypernyms), generating pronunciation databases with IPA and audio, and creating morphological analyzers using inflection data. The tool excels at producing research-quality lexical resources that capture the full richness of Wiktionary's collaborative dictionary data. Integration patterns typically involve either batch processing via the `wiktwords` CLI for large-scale data extraction, or programmatic use through the Python API for custom pipelines. For most users, downloading pre-extracted data from [kaikki.org](https://kaikki.org/dictionary/) is the fastest path to usable data. For custom extractions or specific language combinations, running the tool on fresh Wiktionary dumps (available from [dumps.wikimedia.org](https://dumps.wikimedia.org)) with appropriate configuration provides maximum flexibility. The JSONL output format enables efficient streaming processing and integration with standard data pipelines, while the optional human-readable JSON mode facilitates debugging and manual inspection of extracted entries.