### Install cmudict Package Source: https://github.com/prosegrinder/python-cmudict/blob/main/README.md Install the cmudict package using pip. This is the standard method for installing Python packages. ```bash pip install cmudict ``` -------------------------------- ### Get Punctuation Pronunciation Dictionary with cmudict.vp() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Returns a dictionary mapping 52 punctuation tokens to their ARPAbet pronunciations. Useful for handling punctuation in text processing. ```python import cmudict vp = cmudict.vp() print(len(vp)) # 52 print(vp["!exclamation-point"]) # [['EH2', 'K', 'S', 'K', 'L', 'AH0', 'M', 'EY1', 'SH', 'AH0', 'N', 'P', 'OY2', 'N', 'T']] print(vp[",comma"]) # [['K', 'AA1', 'M', 'AH0']] # List all punctuation tokens for token in sorted(vp.keys()): print(token) # !exclamation-point # "close-quote # ... ``` -------------------------------- ### Get CMUdict file content as a string Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Retrieve the full raw text of cmudict.dict as a string. This is identical to cmudict.raw(). ```python import cmudict s = cmudict.dict_string() print(len(s)) # 3618488 print(s[:80]) ``` -------------------------------- ### Get CMUdict entries as (word, phones) tuples Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Retrieve the CMU lexicon as a flat list of (word, phones) tuples. Preserves all entries, including multiple pronunciations for the same word. Compatible with NLTK's CMUDictCorpusReader.entries(). ```python import cmudict entries = cmudict.entries() print(len(entries)) # 135166 # First five entries for word, phones in entries[:5]: print(word, "->", " ".join(phones)) # a -> AH0 # a(1) -> EY1 # a's -> EY1 Z # a. -> EY1 # a.'s -> EY1 Z # Count words with more than one pronunciation from collections import Counter word_counts = Counter(word for word, _ in entries) multi_pron = [w for w, c in word_counts.items() if c > 1] print(f"{len(multi_pron)} words have multiple pronunciations") # e.g. 9114 words have multiple pronunciations ``` -------------------------------- ### Get all words from CMUdict lexicon Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Obtain a flat list of all lowercase word strings from the CMU lexicon. Words with multiple pronunciations appear multiple times. Compatible with NLTK's CMUDictCorpusReader.words(). ```python import cmudict words = cmudict.words() print(len(words)) # 135166 print(words[:5]) # ['a', 'a(1)', "a's", 'a.', "a.'s"] # Check if a word is in the lexicon word_set = set(words) print("hello" in word_set) # True print("foobar" in word_set) # False ``` -------------------------------- ### Retrieve CMUdict License Source: https://github.com/prosegrinder/python-cmudict/blob/main/README.md Get the license for the CMUdict data set as a string. This function is useful for understanding the terms of use for the dictionary data. ```python cmudict.license_string() # Returns the cmudict license as a string ``` -------------------------------- ### Access CMUdict Data Files Source: https://github.com/prosegrinder/python-cmudict/blob/main/README.md Import the cmudict library and access its data files. Functions are provided to get raw string content, binary streams, or minimally processed structures for dictionary, phones, symbols, and voice print data. ```python import cmudict cmudict.dict() # Compatible with NLTK cmudict.dict_string() cmudict.dict_stream() cmudict.phones() cmudict.phones_string() cmudict.phones_stream() cmudict.symbols() cmudict.symbols_string() cmudict.symbols_stream() cmudict.vp() cmudict.vp_string() cmudict.vp_stream() ``` -------------------------------- ### Access CMUdict as a Python dictionary Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Get the full CMU pronouncing dictionary as a Python dict. Keys are lowercase words, and values are lists of pronunciations. Compatible with NLTK's CMUDictCorpusReader.dict(). ```python import cmudict d = cmudict.dict() # Single-pronunciation word print(d["hello"]) # [['HH', 'AH0', 'L', 'OW1']] # Word with multiple pronunciations print(d["spieth"]) # [['S', 'P', 'IY1', 'TH'], ['S', 'P', 'AY1', 'AH0', 'TH']] # Pronunciation lookup with fallback def get_pronunciation(word: str) -> list[list[str]]: pronunciations = d.get(word.lower()) if pronunciations is None: raise KeyError(f ``` -------------------------------- ### Get Parsed Phone Table with cmudict.phones() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Retrieves the 39 ARPAbet phones used in the dictionary. Useful for analyzing phonetic categories or filtering specific types of phones like vowels. ```python import cmudict phones = cmudict.phones() print(len(phones)) # 39 for phone, categories in phones[:5]: print(phone, "->", categories) # AA -> ['vowel'] # AE -> ['vowel'] # AH -> ['vowel'] # AO -> ['vowel'] # AW -> ['vowel'] # Get all vowel phones vowels = [p for p, cats in phones if "vowel" in cats] print("Vowels:", vowels) # Vowels: ['AA', 'AE', 'AH', 'AO', 'AW', 'AY', 'EH', 'ER', 'EY', 'IH', 'IY', 'OW', 'OY', 'UH', 'UW'] ``` -------------------------------- ### Read raw CMUdict file content as a string Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Get the complete cmudict.dict file as a single string, including comment lines. Useful for regex-based processing or passing raw content to other tools. Compatible with NLTK's CMUDictCorpusReader.raw(). ```python import cmudict raw = cmudict.raw() print(type(raw)) # print(len(raw)) # 3618488 # Print the first three non-comment lines lines = [l for l in raw.splitlines() if not l.startswith(";")] for line in lines[:3]: print(line) # A AH0 # A(1) EY1 # A'S EY1 Z ``` -------------------------------- ### cmudict.vp_stream() / cmudict.vp_string() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Returns the cmudict.vp file as a binary stream or as a string respectively. ```APIDOC ## `cmudict.vp_stream()` / `cmudict.vp_string()` — Raw VP data Returns the `cmudict.vp` file as a binary stream or as a string respectively. ### Usage Example (String) ```python import cmudict s = cmudict.vp_string() print(len(s)) # 1747 ``` ### Usage Example (Stream) ```python import cmudict with cmudict.vp_stream() as stream: for line in stream: print(line.decode("utf-8").strip()) break # !exclamation-point EH2 K S K L AH0 M EY1 SH AH0 N P OY2 N T ``` ``` -------------------------------- ### Access Raw VP Data with cmudict.vp_stream() / cmudict.vp_string() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Provides the cmudict.vp file content as a binary stream or a string. Useful for direct processing of punctuation pronunciation data. ```python import cmudict s = cmudict.vp_string() print(len(s)) # 1747 with cmudict.vp_stream() as stream: for line in stream: print(line.decode("utf-8").strip()) break # !exclamation-point EH2 K S K L AH0 M EY1 SH AH0 N P OY2 N T ``` -------------------------------- ### cmudict.vp() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Returns the cmudict.vp file as a dict mapping 52 punctuation tokens to their ARPAbet pronunciations. ```APIDOC ## `cmudict.vp()` — Punctuation pronunciation dictionary Returns the `cmudict.vp` file as a `dict` mapping 52 punctuation tokens (e.g., `"!exclamation-point"`, `",comma"`) to their ARPAbet pronunciations, in the same structure as `cmudict.dict()`. ### Usage Example ```python import cmudict vp = cmudict.vp() print(len(vp)) # 52 print(vp["!exclamation-point"]) # [['EH2', 'K', 'S', 'K', 'L', 'AH0', 'M', 'EY1', 'SH', 'AH0', 'N', 'P', 'OY2', 'N', 'T']] print(vp[",comma"]) # [['K', 'AA1', 'M', 'AH0']] # List all punctuation tokens for token in sorted(vp.keys()): print(token) # !exclamation-point # "close-quote # ... ``` ``` -------------------------------- ### cmudict.license_string() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Returns the full text of the CMU Pronouncing Dictionary data license as a string. ```APIDOC ## `cmudict.license_string()` — CMUdict data license text Returns the full text of the CMU Pronouncing Dictionary data license as a string, useful for attribution or display in downstream applications. ### Usage Example ```python import cmudict license_text = cmudict.license_string() print(len(license_text)) # 1754 print(license_text[:200]) # CMUdict # ------- # Copyright (C) 1993-2015 Carnegie Mellon University. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions ``` ``` -------------------------------- ### Access Raw Phone Data with cmudict.phones_stream() / cmudict.phones_string() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Provides the cmudict.phones file content as a binary stream or a string. Useful for direct processing or when the full parsed table is not needed. ```python import cmudict # As string (382 characters) s = cmudict.phones_string() print(len(s)) # 382 print(s[:60]) # AA vowel # AE vowel # AH vowel # As stream (useful for streaming pipelines) with cmudict.phones_stream() as stream: for line in stream: phone, ptype = line.decode("utf-8").strip().split() # process each phone inline pass ``` -------------------------------- ### cmudict.phones_stream() / cmudict.phones_string() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Returns the cmudict.phones file as a binary stream or as a string respectively. ```APIDOC ## `cmudict.phones_stream()` / `cmudict.phones_string()` — Raw phone data Returns the `cmudict.phones` file as a binary stream or as a string respectively. ### Usage Example (String) ```python import cmudict s = cmudict.phones_string() print(len(s)) # 382 print(s[:60]) # AA\tvowel # AE\tvowel # AH\tvowel ``` ### Usage Example (Stream) ```python import cmudict with cmudict.phones_stream() as stream: for line in stream: phone, ptype = line.decode("utf-8").strip().split() # process each phone inline pass ``` ``` -------------------------------- ### cmudict.phones() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Returns the 39 ARPAbet phones used in the dictionary as a list of (phone, [category]) tuples. ```APIDOC ## `cmudict.phones()` — Parsed phone table Returns the 39 ARPAbet phones used in the dictionary as a list of `(phone, [category])` tuples where each phone maps to its phonetic category (e.g., vowel or consonant type). ### Usage Example ```python import cmudict phones = cmudict.phones() print(len(phones)) # 39 for phone, categories in phones[:5]: print(phone, "->", categories) # AA -> ['vowel'] # AE -> ['vowel'] # AH -> ['vowel'] # AO -> ['vowel'] # AW -> ['vowel'] # Get all vowel phones vowels = [p for p, cats in phones if "vowel" in cats] print("Vowels:", vowels) # Vowels: ['AA', 'AE', 'AH', 'AO', 'AW', 'AY', 'EH', 'ER', 'EY', 'IH', 'IY', 'OW', 'OY', 'UH', 'UW'] ``` ``` -------------------------------- ### Retrieve CMUdict License Text with cmudict.license_string() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Returns the full text of the CMU Pronouncing Dictionary data license as a string. Essential for attribution and compliance in applications using the dictionary. ```python import cmudict license_text = cmudict.license_string() print(len(license_text)) # 1754 print(license_text[:200]) # CMUdict # ------- # Copyright (C) 1993-2015 Carnegie Mellon University. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions ``` -------------------------------- ### List All Phonetic Symbols with cmudict.symbols() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Returns a flat list of all 84 phonetic symbols, including ARPAbet phones and stress markers. Useful for validating pronunciations against the complete set of symbols. ```python import cmudict syms = cmudict.symbols() print(len(syms)) # 84 print(syms[:10]) # ['AA', 'AA0', 'AA1', 'AA2', 'AE', 'AE0', 'AE1', 'AE2', 'AH', 'AH0'] # Validate that a pronunciation uses only known symbols def is_valid_pronunciation(phones: list[str]) -> bool: valid = set(cmudict.symbols()) return all(p in valid for p in phones) print(is_valid_pronunciation(["HH", "AH0", "L", "OW1"])) # True print(is_valid_pronunciation(["HH", "XX", "L", "OW1"])) # False ``` -------------------------------- ### cmudict.symbols_stream() / cmudict.symbols_string() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Returns the cmudict.symbols file as a binary stream or as a string respectively. ```APIDOC ## `cmudict.symbols_stream()` / `cmudict.symbols_string()` — Raw symbols data Returns the `cmudict.symbols` file as a binary stream or as a string respectively. ### Usage Example (String) ```python import cmudict s = cmudict.symbols_string() print(len(s)) # 281 print(s[:40]) # AA # AA0 # AA1 ``` ### Usage Example (Stream) ```python import cmudict with cmudict.symbols_stream() as stream: all_syms = [line.decode("utf-8").strip() for line in stream] print(all_syms[:5]) # ['AA', 'AA0', 'AA1', 'AA2', 'AE'] ``` ``` -------------------------------- ### cmudict.dict_string() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Returns the full raw text of cmudict.dict as a string, identical to cmudict.raw(). ```APIDOC ## cmudict.dict_string() ### Description Returns the full raw text of `cmudict.dict` as a string. Identical to `cmudict.raw()`. ### Usage ```python import cmudict s = cmudict.dict_string() print(len(s)) # 3618488 print(s[:80]) ``` ``` -------------------------------- ### cmudict.entries() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Returns the CMU lexicon as a flat list of (word, phones) tuples, preserving all entries including multiple pronunciations for the same word. ```APIDOC ## cmudict.entries() ### Description Returns the CMU lexicon as a flat list of `(word, phones)` tuples, preserving all 135,166 entries including multiple pronunciations for the same word (each as a separate tuple). Compatible with NLTK's `CMUDictCorpusReader.entries()`. ### Usage ```python import cmudict entries = cmudict.entries() print(len(entries)) # 135166 # First five entries for word, phones in entries[:5]: print(word, "->", " ".join(phones)) # a -> AH0 # a(1) -> EY1 # a's -> EY1 Z # a. -> EY1 # a.'s -> EY1 Z ``` ### Example: Counting Multiple Pronunciations ```python from collections import Counter word_counts = Counter(word for word, _ in entries) multi_pron = [w for w, c in word_counts.items() if c > 1] print(f"{len(multi_pron)} words have multiple pronunciations") # e.g. 9114 words have multiple pronunciations ``` ``` -------------------------------- ### cmudict.raw() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Returns the complete cmudict.dict file as a single string, including comment lines. Useful for regex-based processing. ```APIDOC ## cmudict.raw() ### Description Returns the complete `cmudict.dict` file as a single string (3,618,488 characters), including comment lines that begin with `;`. Useful for regex-based processing or passing the raw content to another tool. Compatible with NLTK's `CMUDictCorpusReader.raw()`. ### Usage ```python import cmudict raw = cmudict.raw() print(type(raw)) # print(len(raw)) # 3618488 ``` ### Example: Printing First Three Non-Comment Lines ```python lines = [l for l in raw.splitlines() if not l.startswith(";")] for line in lines[:3]: print(line) # A AH0 # A(1) EY1 # A'S EY1 Z ``` ``` -------------------------------- ### cmudict.words() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Returns a flat list of all lowercase word strings from the CMU lexicon, including duplicates for words with multiple pronunciations. ```APIDOC ## cmudict.words() ### Description Returns a flat list of all 135,166 lowercase word strings from the CMU lexicon (one entry per pronunciation row, so words with multiple pronunciations appear more than once). Compatible with NLTK's `CMUDictCorpusReader.words()`. ### Usage ```python import cmudict words = cmudict.words() print(len(words)) # 135166 print(words[:5]) # ['a', 'a(1)', "a's", 'a.', "a.'s"] ``` ### Example: Checking Word Existence ```python word_set = set(words) print("hello" in word_set) # True print("foobar" in word_set) # False ``` ``` -------------------------------- ### Access Raw Symbols Data with cmudict.symbols_stream() / cmudict.symbols_string() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Provides the cmudict.symbols file content as a binary stream or a string. Useful for direct processing of the symbol list. ```python import cmudict s = cmudict.symbols_string() print(len(s)) # 281 print(s[:40]) # AA # AA0 # AA1 with cmudict.symbols_stream() as stream: all_syms = [line.decode("utf-8").strip() for line in stream] print(all_syms[:5]) # ['AA', 'AA0', 'AA1', 'AA2', 'AE'] ``` -------------------------------- ### cmudict.dict() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Returns the full CMU pronouncing dictionary as a dict. Keys are lowercase words, and values are lists of pronunciations (each pronunciation is a list of ARPAbet phone strings). ```APIDOC ## cmudict.dict() ### Description Returns the full CMU pronouncing dictionary as a `dict` whose keys are lowercase words and whose values are lists of pronunciations (each pronunciation is a list of ARPAbet phone strings). Words with multiple valid pronunciations have multiple entries in the list. Compatible with NLTK's `CMUDictCorpusReader.dict()`. ### Usage ```python import cmudict d = cmudict.dict() # Single-pronunciation word print(d["hello"]) # [['HH', 'AH0', 'L', 'OW1']] # Word with multiple pronunciations print(d["spieth"]) # [['S', 'P', 'IY1', 'TH'], ['S', 'P', 'AY1', 'AH0', 'TH']] ``` ### Example Function ```python def get_pronunciation(word: str) -> list[list[str]]: pronunciations = d.get(word.lower()) if pronunciations is None: raise KeyError(f"'{word}' not found in CMUdict ({len(d)} entries)") return pronunciations print(get_pronunciation("Python")) # [['P', 'AY1', 'TH', 'AH0', 'N']] ``` ``` -------------------------------- ### cmudict.dict_stream() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Returns an open binary file-like object for cmudict.dict, suitable for memory-efficient line-by-line processing. ```APIDOC ## cmudict.dict_stream() ### Description Returns an open binary file-like object (`IO[bytes]`) for `cmudict.dict`. Useful for memory-efficient line-by-line processing of the full dictionary without loading it entirely into memory. The caller is responsible for closing the stream. ### Usage ```python import cmudict pronunciations = [] filehandle = cmudict.dict_stream() for line in filehandle: decoded = line.strip().decode("utf-8") if decoded.startswith(";"): # skip comment lines continue word, phones = decoded.split(" ", 1) pronunciations.append((word.split("(", 1)[0].lower(), phones)) filehandle.close() print(len(pronunciations)) # 135166 print(pronunciations[0]) # ('a', 'AH0') ``` ``` -------------------------------- ### cmudict.symbols() Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Returns all 84 phonetic symbols (ARPAbet phones plus stress markers) used in the dictionary as a flat list of strings. ```APIDOC ## `cmudict.symbols()` — List of all phonetic symbols Returns all 84 phonetic symbols (ARPAbet phones plus stress markers) used in the dictionary as a flat list of strings. ### Usage Example ```python import cmudict syms = cmudict.symbols() print(len(syms)) # 84 print(syms[:10]) # ['AA', 'AA0', 'AA1', 'AA2', 'AE', 'AE0', 'AE1', 'AE2', 'AH', 'AH0'] # Validate that a pronunciation uses only known symbols def is_valid_pronunciation(phones: list[str]) -> bool: valid = set(cmudict.symbols()) return all(p in valid for p in phones) print(is_valid_pronunciation(["HH", "AH0", "L", "OW1"])) # True print(is_valid_pronunciation(["HH", "XX", "L", "OW1"])) # False ``` ``` -------------------------------- ### Access CMUdict file as a binary stream Source: https://context7.com/prosegrinder/python-cmudict/llms.txt Obtain an open binary file-like object for cmudict.dict. Useful for memory-efficient line-by-line processing without loading the entire dictionary into memory. The caller is responsible for closing the stream. ```python import cmudict pronunciations = [] filehandle = cmudict.dict_stream() for line in filehandle: decoded = line.strip().decode("utf-8") if decoded.startswith(";"): # skip comment lines continue word, phones = decoded.split(" ", 1) pronunciations.append((word.split("(", 1)[0].lower(), phones)) filehandle.close() print(len(pronunciations)) # 135166 print(pronunciations[0]) # ('a', 'AH0') ``` -------------------------------- ### NLTK Compatibility Functions Source: https://github.com/prosegrinder/python-cmudict/blob/main/README.md Utilize functions that maintain compatibility with NLTK's corpus reader for CMUdict. These include accessing entries, raw data, and words. ```python cmudict.entries() # Compatible with NLTK cmudict.raw() # Compatible with NLTK cmudict.words() # Compatible with NTLK ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.