Wiktextract (tatuylonen/wiktextract)

Wiktextract

https://github.com/tatuylonen/wiktextract
Admin
Wiktextract is a Python package and utility for extracting structured multilingual data from...

Tokens:29,993
Snippets:229
Trust Score:6.8
Update:1 month ago
Show doc for...
Context Summary (auto-generated)
Raw
# Wiktextract

Wiktextract is a Python package and utility for extracting structured, machine-readable data from Wiktionary dump files. It parses XML dump files from various Wiktionary language editions (most notably English Wiktionary) and outputs comprehensive lexical data in JSONL format, including word definitions, parts of speech, translations, pronunciations (with IPA and audio links), etymologies, inflection tables, synonyms, antonyms, and other linguistic relationships. The tool stands out by expanding Wiktionary's templates and Lua macros during extraction, enabling superior accuracy and completeness compared to simple text parsing approaches.

The package supports extraction from multiple Wiktionary editions including English, German, French, Spanish, Chinese, Japanese, Korean, and many others. It provides both a command-line interface (`wiktwords`) for straightforward batch processing and a Python API for programmatic integration. The extracted data is useful for natural language processing, machine translation, semantic parsing, language generation applications, and building multilingual dictionaries with declension/conjugation information.

## WiktionaryConfig

The `WiktionaryConfig` class controls what data to capture during extraction and collects statistics. It specifies which languages to process, which data types to extract (translations, pronunciations, etymologies, etc.), and maintains error/warning logs during processing.

```python
from wiktextract import WiktionaryConfig

# Create configuration for extracting all data types for English and Translingual
config = WiktionaryConfig(
    dump_file_lang_code="en",                    # Wiktionary edition language code
    capture_language_codes=["en", "mul", "de"],  # Languages to extract (None = all)
    capture_translations=True,                   # Extract translation data
    capture_pronunciation=True,                  # Extract IPA, audio files, etc.
    capture_linkages=True,                       # Extract synonyms, antonyms, etc.
    capture_compounds=True,                      # Extract compound words
    capture_redirects=True,                      # Extract redirect pages
    capture_examples=True,                       # Extract usage examples
    capture_etymologies=True,                    # Extract etymology information
    capture_inflections=True,                    # Extract inflection tables
    capture_descendants=True,                    # Extract descendant words
    verbose=False,                               # Enable verbose logging
    expand_tables=False,                         # Expand inflection tables to text
)

# Access statistics after processing
print(f"Pages processed: {config.num_pages}")
print(f"Languages found: {dict(config.language_counts)}")
print(f"Parts of speech: {dict(config.pos_counts)}")
print(f"Errors: {len(config.errors)}")
print(f"Warnings: {len(config.warnings)}")
```

## WiktextractContext

The `WiktextractContext` class is the main processing context that combines the wikitextprocessor's `Wtp` context with `WiktionaryConfig`. It manages database connections, thesaurus data, and provides the environment for page parsing operations.

```python
from wiktextract import WiktextractContext, WiktionaryConfig
from wikitextprocessor import Wtp

# Create the wikitextprocessor context
wtp = Wtp(
    db_path="wiktionary.db",     # SQLite database path for caching
    lang_code="en",              # Wiktionary edition language
    quiet=False,                 # Show progress messages
)

# Create wiktextract configuration
config = WiktionaryConfig(
    dump_file_lang_code="en",
    capture_language_codes=None,  # Capture all languages
    capture_translations=True,
    capture_pronunciation=True,
    capture_linkages=True,
    capture_examples=True,
    capture_etymologies=True,
    capture_inflections=True,
    capture_descendants=True,
)

# Create the extraction context
wxr = WiktextractContext(wtp, config)

# Context provides access to current processing state
wxr.lang = "English"        # Current language being processed
wxr.word = "example"        # Current word being processed
wxr.pos = "noun"            # Current part of speech

# Reconnect databases (useful in multiprocessing scenarios)
wxr.reconnect_databases(check_same_thread=True)

# Clean up before passing to worker processes
wxr.remove_unpicklable_objects()
```

## parse_wiktionary

The `parse_wiktionary` function is the primary entry point for processing entire Wiktionary dump files. It handles the two-phase extraction process: first parsing templates and modules, then extracting word data in parallel.

```python
from wiktextract import WiktextractContext, WiktionaryConfig, parse_wiktionary
from wikitextprocessor import Wtp

# Set up the extraction context
config = WiktionaryConfig(
    dump_file_lang_code="en",
    capture_language_codes=["en", "mul"],
    capture_translations=True,
    capture_pronunciation=True,
    capture_linkages=True,
    capture_examples=True,
    capture_etymologies=True,
    capture_inflections=True,
    capture_descendants=True,
)

wtp = Wtp(db_path="enwiktionary.db", lang_code="en")
wxr = WiktextractContext(wtp, config)

# Define namespace IDs to process (Main=0, Template=10, Module=828)
namespace_ids = {
    wxr.wtp.NAMESPACE_DATA.get(name, {}).get("id", 0)
    for name in ["Main", "Template", "Module"]
}

# Process the dump file and write to output
with open("output.jsonl", "w", encoding="utf-8") as out_f:
    parse_wiktionary(
        wxr=wxr,
        dump_path="enwiktionary-20240101-pages-articles.xml.bz2",
        num_processes=8,                    # Parallel processes (None = auto)
        phase1_only=False,                  # False = full extraction
        namespace_ids=namespace_ids,        # Namespaces to process
        out_f=out_f,                        # Output file object
        human_readable=False,               # True = pretty-printed JSON
        override_folders=None,              # Override page folders
        skip_extract_dump=False,            # Skip if DB exists
        save_pages_path=None,               # Save extracted pages
    )

# Output: JSONL file with one JSON object per line
# {"word": "example", "lang": "English", "lang_code": "en", "pos": "noun", ...}
```

## reprocess_wiktionary

The `reprocess_wiktionary` function processes pages from an existing SQLite database without re-parsing the dump file. This is useful for re-extracting data with different settings or debugging specific pages.

```python
from wiktextract import WiktextractContext, WiktionaryConfig, reprocess_wiktionary
from wikitextprocessor import Wtp

# Load from existing database
wtp = Wtp(db_path="enwiktionary.db", lang_code="en")
config = WiktionaryConfig(
    dump_file_lang_code="en",
    capture_language_codes=["en"],
    capture_translations=True,
    capture_pronunciation=True,
)
wxr = WiktextractContext(wtp, config)

with open("reprocessed.jsonl", "w", encoding="utf-8") as out_f:
    reprocess_wiktionary(
        wxr=wxr,
        num_processes=4,
        out_f=out_f,
        human_readable=False,
        search_pattern=None,  # Optional: filter pages by pattern
    )

# Using search_pattern to filter pages
# search_pattern uses SQL LIKE syntax: % = any chars, _ = single char
with open("english_only.jsonl", "w", encoding="utf-8") as out_f:
    reprocess_wiktionary(
        wxr=wxr,
        num_processes=4,
        out_f=out_f,
        human_readable=True,
        search_pattern="%==English==%",  # Only pages with English section
    )
```

## parse_page

The `parse_page` function extracts data from a single Wiktionary page given its title and wikitext content. This is useful for testing, debugging, or processing individual pages.

```python
from wiktextract import WiktextractContext, WiktionaryConfig, parse_page
from wikitextprocessor import Wtp

# Set up context
wtp = Wtp(db_path="enwiktionary.db", lang_code="en")
config = WiktionaryConfig(
    dump_file_lang_code="en",
    capture_language_codes=["en"],
    capture_translations=True,
    capture_pronunciation=True,
    capture_linkages=True,
    capture_examples=True,
    capture_etymologies=True,
)
wxr = WiktextractContext(wtp, config)

# Example Wiktionary page content
page_text = """
==English==

===Etymology===
From {{inh|en|enm|word}}, from {{inh|en|ang|word}}.

===Pronunciation===
* {{IPA|en|/wɜːd/}}
* {{audio|en|en-us-word.ogg|Audio (US)}}

===Noun===
{{en-noun}}

# A unit of language.
#: {{ux|en|Be careful with your '''words'''.}}
# A promise.
#: {{syn|en|promise|vow}}

===Verb===
{{en-verb}}

# To say or write using words.

====Synonyms====
* {{l|en|express}}
* {{l|en|phrase}}
"""

# Parse the page
results = parse_page(wxr, "word", page_text)

# Results is a list of dictionaries, one per word/POS combination
for entry in results:
    print(f"Word: {entry.get('word')}")
    print(f"Language: {entry.get('lang')}")
    print(f"POS: {entry.get('pos')}")
    print(f"Senses: {len(entry.get('senses', []))}")
    for sense in entry.get('senses', []):
        print(f"  - {sense.get('glosses', [''])[0]}")
    print(f"Sounds: {entry.get('sounds', [])}")
    print(f"Etymology: {entry.get('etymology_text', '')[:100]}...")
    print("---")

# Example output structure:
# {
#   "word": "word",
#   "lang": "English",
#   "lang_code": "en",
#   "pos": "noun",
#   "etymology_text": "From Middle English word, from Old English word.",
#   "sounds": [{"ipa": "/wɜːd/"}, {"audio": "en-us-word.ogg", ...}],
#   "senses": [
#     {"glosses": ["A unit of language."], "examples": [...]},
#     {"glosses": ["A promise."], "synonyms": [...]}
#   ],
#   "forms": [{"form": "words", "tags": ["plural"]}]
# }
```

## extract_thesaurus_data

The `extract_thesaurus_data` function processes Wiktionary's Thesaurus namespace pages to extract synonym, antonym, and other word relationship data. The extracted data is stored in a SQLite database and can be merged into main word entries.

```python
from wiktextract import WiktextractContext, WiktionaryConfig, extract_thesaurus_data
from wikitextprocessor import Wtp

# Enable thesaurus extraction in config
config = WiktionaryConfig(
    dump_file_lang_code="en",
    capture_language_codes=["en"],
    capture_linkages=True,
)
config.extract_thesaurus_pages = True  # Enable thesaurus processing

wtp = Wtp(db_path="enwiktionary.db", lang_code="en")
wxr = WiktextractContext(wtp, config)

# Extract thesaurus data (requires dump to be already processed)
extract_thesaurus_data(wxr, num_processes=4)

# Query thesaurus data programmatically
from wiktextract.thesaurus import search_thesaurus

# Find synonyms for "happy" as an adjective
for term in search_thesaurus(
    wxr.thesaurus_db_conn,
    entry="happy",
    lang_code="en",
    pos="adj",
    linkage_type="synonyms",  # Optional: filter by linkage type
):
    print(f"Term: {term.term}")
    print(f"Linkage: {term.linkage}")
    print(f"Tags: {term.tags}")
    print(f"Sense: {term.sense}")

# ThesaurusTerm dataclass fields:
# - entry: str          # Main word entry
# - language_code: str  # Language code
# - pos: str            # Part of speech
# - linkage: str        # Relationship type (synonyms, antonyms, etc.)
# - term: str           # Related term
# - tags: list[str]     # Qualifier tags
# - raw_tags: list[str] # Unparsed tags
# - topics: list[str]   # Topic categories
# - roman: str          # Romanization
# - sense: str          # Sense description
```

## extract_categories

The `extract_categories` function extracts the Wiktionary category tree hierarchy by evaluating Lua modules. This provides structured access to the category relationships defined in Wiktionary.

```python
from wiktextract import WiktextractContext, WiktionaryConfig, extract_categories
from wikitextprocessor import Wtp
import json

# Set up context (requires processed dump with Lua modules)
wtp = Wtp(db_path="enwiktionary.db", lang_code="en")
config = WiktionaryConfig(dump_file_lang_code="en")
wxr = WiktextractContext(wtp, config)

# Extract category tree
category_tree = extract_categories(wxr)

# Structure of returned data:
# {
#   "roots": ["Fundamental", ...],  # Top-level categories
#   "nodes": {
#     "category name": {
#       "name": "Category Name",
#       "desc": "Description with {{templates}}",
#       "clean_desc": "Cleaned description text",
#       "children": ["child1", "child2"],
#       "sort": ["sort key 1", "sort key 2"]
#     }
#   }
# }

# Save to file
with open("categories.json", "w", encoding="utf-8") as f:
    json.dump(category_tree, f, indent=2, sort_keys=True)

# Access category information
print(f"Root categories: {category_tree['roots']}")
print(f"Total categories: {len(category_tree['nodes'])}")

# Example: find all children of "Emotions" category
emotions = category_tree['nodes'].get('emotions', {})
print(f"Emotions subcategories: {emotions.get('children', [])}")
```

## extract_namespace

The `extract_namespace` function exports all pages from a specific namespace (like Template or Module) to a tar archive. This is useful for backing up or analyzing Wiktionary's template/module infrastructure.

```python
from wiktextract import WiktextractContext, WiktionaryConfig, extract_namespace
from wikitextprocessor import Wtp

# Set up context
wtp = Wtp(db_path="enwiktionary.db", lang_code="en")
config = WiktionaryConfig(dump_file_lang_code="en")
wxr = WiktextractContext(wtp, config)

# Extract all templates to a tar file
extract_namespace(wxr, "Template", "templates.tar")

# Extract all Lua modules to a tar file
extract_namespace(wxr, "Module", "modules.tar")

# The tar file structure:
# templates.tar/
#   Template/
#     en-noun.txt
#     IPA.txt
#     ...
#
# Each file contains the wikitext source of the template/module
```

## Command-Line Interface (wiktwords)

The `wiktwords` command provides a comprehensive CLI for extracting data from Wiktionary dumps without writing Python code.

```bash
# Basic extraction for all languages with all data types
wiktwords --all --all-languages --out data.jsonl \
    --edition en enwiktionary-20240101-pages-articles.xml.bz2

# Extract only English and German with specific data types
wiktwords --language-code en --language-code de \
    --translations --pronunciations --linkages \
    --out english_german.jsonl --edition en \
    enwiktionary-20240101-pages-articles.xml.bz2

# Create database for faster subsequent processing
wiktwords --db-path enwikt.db --edition en \
    enwiktionary-20240101-pages-articles.xml.bz2

# Process single page for debugging (with existing database)
wiktwords --db-path enwikt.db --edition en --all --all-languages \
    --out word_debug.json --page "word" --human-readable

# Process page from file
# File format: first line "TITLE: page_title", rest is wikitext
wiktwords --db-path enwikt.db --edition en --all \
    --out test.json --page test_page.txt --human-readable

# Extract with parallel processing control
wiktwords --all --all-languages --num-processes 16 \
    --out data.jsonl --edition en dump.xml.bz2

# Extract category tree, templates, and modules
wiktwords --db-path enwikt.db --edition en \
    --categories-file categories.json \
    --templates-file templates.tar \
    --modules-file modules.tar \
    --skip-extraction dump.xml.bz2

# Filter pages by pattern (with existing database)
wiktwords --db-path enwikt.db --edition en --all \
    --search-pattern "%==English==%==Noun==%" \
    --out english_nouns.jsonl

# Run extraction for non-English Wiktionary
wiktwords --all --all-languages --out fr_data.jsonl \
    --edition fr frwiktionary-20240101-pages-articles.xml.bz2

# Using container (Podman/Docker)
podman run -v /data:/data -it --rm ghcr.io/tatuylonen/wiktextract \
    --all --all-languages --out /data/output.jsonl \
    --edition en /data/enwiktionary-20240101-pages-articles.xml.bz2
```

## Reading Extracted JSONL Data

The extracted data is in JSONL format (one JSON object per line). Here are examples of reading and processing the data.

```python
import json
from collections import defaultdict

# Read JSONL file line by line (memory efficient)
def read_wiktextract_data(filename):
    with open(filename, encoding="utf-8") as f:
        for line in f:
            yield json.loads(line)

# Example: count entries by language
language_counts = defaultdict(int)
for entry in read_wiktextract_data("data.jsonl"):
    lang = entry.get("lang", "Unknown")
    language_counts[lang] += 1

print(dict(language_counts))

# Example: extract all English nouns with their definitions
english_nouns = []
for entry in read_wiktextract_data("data.jsonl"):
    if entry.get("lang_code") == "en" and entry.get("pos") == "noun":
        word_data = {
            "word": entry["word"],
            "definitions": [
                sense.get("glosses", [""])[0]
                for sense in entry.get("senses", [])
            ],
            "forms": entry.get("forms", []),
        }
        english_nouns.append(word_data)

# Example: find words with specific pronunciation
def find_by_ipa(data_file, ipa_pattern):
    import re
    pattern = re.compile(ipa_pattern)
    for entry in read_wiktextract_data(data_file):
        for sound in entry.get("sounds", []):
            if "ipa" in sound and pattern.search(sound["ipa"]):
                yield entry
                break

# Find words with /θ/ sound
for entry in find_by_ipa("data.jsonl", r"/.*θ.*"):
    print(f"{entry['word']}: {entry.get('sounds', [])}")

# Example: build translation dictionary
translations = defaultdict(list)
for entry in read_wiktextract_data("data.jsonl"):
    if entry.get("lang_code") != "en":
        continue
    word = entry["word"]
    for trans in entry.get("translations", []):
        if trans.get("lang_code") == "de":  # German translations
            translations[word].append({
                "german": trans.get("word"),
                "sense": trans.get("sense", ""),
            })

# Pretty-print a single entry for inspection
def pretty_print_entry(entry):
    print(json.dumps(entry, indent=2, sort_keys=True, ensure_ascii=False))

# Example entry structure:
example_entry = {
    "word": "thrill",
    "lang": "English",
    "lang_code": "en",
    "pos": "verb",
    "etymology_text": "From Middle English thrillen...",
    "etymology_templates": [
        {"name": "inh", "args": {"1": "en", "2": "enm", "3": "thrillen"},
         "expansion": "Middle English thrillen"}
    ],
    "sounds": [
        {"ipa": "/θɹɪl/"},
        {"audio": "en-us-thrill.ogg", "mp3_url": "...", "ogg_url": "..."}
    ],
    "forms": [
        {"form": "thrills", "tags": ["present", "singular", "third-person"]},
        {"form": "thrilling", "tags": ["present", "participle"]},
        {"form": "thrilled", "tags": ["past", "participle"]}
    ],
    "senses": [
        {
            "glosses": ["To suddenly excite someone..."],
            "tags": ["ergative", "figuratively"],
            "examples": [{"text": "The movie thrilled audiences."}],
            "synonyms": [{"word": "excite"}, {"word": "electrify"}]
        }
    ],
    "translations": [
        {"lang": "German", "lang_code": "de", "word": "begeistern",
         "sense": "to excite"}
    ]
}
```

## Data Structure Reference

The extracted JSON entries follow a consistent structure with these key fields.

```python
# Complete field reference for word entries
word_entry = {
    # Required fields
    "word": "example",          # The headword
    "lang": "English",          # Language name
    "lang_code": "en",          # ISO language code
    "pos": "noun",              # Part of speech

    # Senses (at least one required)
    "senses": [
        {
            "glosses": ["Definition text"],
            "raw_glosses": ["Definition with qualifiers"],
            "tags": ["formal", "archaic"],
            "categories": ["English nouns"],
            "topics": ["linguistics"],
            "examples": [
                {
                    "text": "Example sentence",
                    "ref": "Source reference",
                    "english": "Translation if non-English",
                    "type": "quotation",  # or "example"
                }
            ],
            "synonyms": [{"word": "synonym", "tags": ["informal"]}],
            "antonyms": [{"word": "antonym"}],
            "hypernyms": [{"word": "hypernym"}],
            "hyponyms": [{"word": "hyponym"}],
            "meronyms": [{"word": "part"}],
            "holonyms": [{"word": "whole"}],
            "coordinate_terms": [{"word": "sibling"}],
            "derived": [{"word": "derived_word"}],
            "related": [{"word": "related_word"}],
            "alt_of": [{"word": "main_form", "extra": "notes"}],
            "form_of": [{"word": "lemma", "extra": "inflection type"}],
            "wikidata": ["Q12345"],
            "wikipedia": ["Article_name"],
        }
    ],

    # Pronunciation
    "sounds": [
        {"ipa": "/ɪɡˈzæmpəl/", "tags": ["UK"]},
        {"ipa": "[ɪɡˈzæmpəɫ]", "tags": ["US"]},
        {"enpr": "ĭg-zăm′-pəl"},
        {"audio": "en-us-example.ogg",
         "ogg_url": "https://...",
         "mp3_url": "https://..."},
        {"rhymes": "-æmpəl"},
        {"homophones": ["homophone1"]},
        {"hyphenation": ["ex", "am", "ple"]},
    ],

    # Forms/inflections
    "forms": [
        {"form": "examples", "tags": ["plural"]},
        {"form": "exampled", "tags": ["past"]},
    ],

    # Etymology
    "etymology_text": "From Latin exemplum...",
    "etymology_number": 1,  # If multiple etymologies
    "etymology_templates": [
        {"name": "der", "args": {"1": "en", "2": "la", "3": "exemplum"},
         "expansion": "Latin exemplum"}
    ],

    # Descendants
    "descendants": [
        {"depth": 1, "text": "French: exemple",
         "templates": [{"name": "desc", "args": {...}}]}
    ],

    # Translations (usually on English entries)
    "translations": [
        {"lang": "French", "lang_code": "fr", "word": "exemple",
         "sense": "something representative", "tags": ["masculine"]}
    ],

    # Categories and metadata
    "categories": ["English nouns", "English terms derived from Latin"],
    "topics": ["linguistics", "grammar"],
    "wikidata": ["Q12345"],
    "wikipedia": ["Example"],

    # Templates (for advanced processing)
    "head_templates": [{"name": "en-noun", "args": {}, "expansion": "..."}],
    "inflection_templates": [{"name": "en-noun", "args": {}}],
}

# Redirect entry structure
redirect_entry = {
    "title": "colour",
    "redirect": "color",
    "pos": "hard-redirect",
}
```

## Summary

Wiktextract serves as a comprehensive solution for extracting machine-readable lexical data from Wiktionary. Its primary use cases include building multilingual dictionaries for NLP applications, creating training data for machine translation systems, constructing knowledge graphs of word relationships (synonyms, antonyms, hypernyms), generating pronunciation databases with IPA and audio, and creating morphological analyzers using inflection data. The tool excels at producing research-quality lexical resources that capture the full richness of Wiktionary's collaborative dictionary data.

Integration patterns typically involve either batch processing via the `wiktwords` CLI for large-scale data extraction, or programmatic use through the Python API for custom pipelines. For most users, downloading pre-extracted data from [kaikki.org](https://kaikki.org/dictionary/) is the fastest path to usable data. For custom extractions or specific language combinations, running the tool on fresh Wiktionary dumps (available from [dumps.wikimedia.org](https://dumps.wikimedia.org)) with appropriate configuration provides maximum flexibility. The JSONL output format enables efficient streaming processing and integration with standard data pipelines, while the optional human-readable JSON mode facilitates debugging and manual inspection of extracted entries.