### Install WordSegment Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/index.md Install the WordSegment library using pip. This is the first step to using the library in your Python projects. ```bash $ pip install wordsegment ``` -------------------------------- ### Install wordsegment with pip Source: https://github.com/grantjenks/python-wordsegment/blob/master/README.rst Install the wordsegment library using pip. This is the recommended method for adding the library to your project. ```bash pip install wordsegment ``` -------------------------------- ### Run wordsegment as a Server Process Source: https://github.com/grantjenks/python-wordsegment/blob/master/README.rst Execute the wordsegment module with unbuffered output for use in a server-like process. This example demonstrates writing to stdin and reading from stdout. ```python import subprocess as sp wordsegment = sp.Popen( ['python', '-um', 'wordsegment'], stdin=sp.PIPE, stdout=sp.PIPE, stderr=sp.STDOUT) wordsegment.stdin.write('thisisatest\n') wordsegment.stdout.readline() wordsegment.stdin.write('workswithotherlanguages\n') wordsegment.stdout.readline() wordsegment.stdin.close() wordsegment.wait() # Process exit code. ``` -------------------------------- ### Server Process with Unbuffered Output Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/index.md Run WordSegment as a server process using Python's -u option for unbuffered output. This example demonstrates piping input and reading output from the subprocess. ```python >>> import subprocess as sp >>> wordsegment = sp.Popen( ['python', '-um', 'wordsegment'], stdin=sp.PIPE, stdout=sp.PIPE, stderr=sp.STDOUT) >>> wordsegment.stdin.write('thisisatest\n') >>> wordsegment.stdout.readline() 'this is a test\n' >>> wordsegment.stdin.write('workswithotherlanguages\n') >>> wordsegment.stdout.readline() 'works with other languages\n' >>> wordsegment.stdin.close() >>> wordsegment.wait() # Process exit code. 0 ``` -------------------------------- ### Get CPU and Python Version Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.md Retrieves and prints the CPU brand string and the Python version. Useful for performance benchmarking context. ```python import subprocess print subprocess.check_output([ '/usr/sbin/sysctl', '-n', 'machdep.cpu.brand_string' ]) import sys print sys.version ``` -------------------------------- ### Access Specific Bigram Counts Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/index.md Retrieve the count for specific bigrams from the BIGRAMS dictionary. The '' prefix indicates the start of a bigram. ```python >>> ws.BIGRAMS[' where'] 15419048.0 >>> ws.BIGRAMS[' what'] 11779290.0 ``` -------------------------------- ### View Sample Dictionary Entries Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/using-a-different-corpus.md Displays the first few entries from the UNIGRAMS and BIGRAMS dictionaries to show their structure (word/phrase and count). ```python print wordsegment.UNIGRAMS.items()[:3] print wordsegment.BIGRAMS.items()[:2] ``` -------------------------------- ### Import izip for Dictionary Creation Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.md Imports the izip function from itertools, aliased as zip, for efficient pairing of keys and values. ```python from itertools import izip as zip ``` -------------------------------- ### Load Dictionary from Text and Binary Files Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.md Reads words from a text file and counts from a binary file, then constructs a dictionary. This method is optimized for speed by using str.split for words and the array module for binary counts. ```python with open('words.txt', 'rb') as lines, open('counts.bin', 'rb') as counts: words = lines.read().split('\n') values = array('d') values.fromfile(counts, 333333) dict(zip(words, values)) ``` -------------------------------- ### Construct Dictionary using __setitem__ Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.md Benchmarks the time taken to construct a dictionary by repeatedly calling __setitem__. ```python # %%timeit with open('../wordsegment_data/unigrams.txt') as reader: lines = (line.split('\t') for line in reader) result = dict() for word, number in lines: result[word] = float(number) ``` -------------------------------- ### Access Documentation in Interpreter Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/index.md Access the built-in help documentation for the wordsegment library directly within the Python interpreter. ```python >>> import wordsegment >>> help(wordsegment) ``` -------------------------------- ### Load WordSegment Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/using-a-different-corpus.md Initializes the WordSegment library. This should be called before accessing or modifying its dictionaries. ```python import wordsegment wordsegment.load() ``` -------------------------------- ### wordsegment.load() Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/api.md Loads unigram and bigram counts from disk. ```APIDOC ## wordsegment.load() ### Description Load unigram and bigram counts from disk. ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example None ### Response #### Success Response (200) None ### Response Example None ``` -------------------------------- ### Fast Dictionary Loading from Files in Python Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.ipynb Measures the time taken to load a dictionary from two files: 'words.txt' for keys and 'counts.bin' for values. Uses `str.split` for words and the `array` module for binary counts. Best used when optimizing for read performance from disk. ```python from array import array %%timeit with open('words.txt', 'rb') as lines, open('counts.bin', 'rb') as counts: words = lines.read().split('\n') values = array('d') values.fromfile(counts, 333333) dict(zip(words, values)) ``` -------------------------------- ### Time Dictionary Construction from File Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.ipynb Measures the time taken to read a file, split lines, and construct a dictionary. This serves as a baseline for performance comparisons. ```python %%timeit with open('../wordsegment_data/unigrams.txt') as reader: lines = (line.split('\t') for line in reader) dict((word, float(number)) for word, number in lines) ``` -------------------------------- ### Read All Lines from File Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.md Reads all lines from the unigrams.txt file into a list. This is a benchmark for file reading performance. ```python # %%timeit with open('../wordsegment_data/unigrams.txt') as reader: lines = [line for line in reader] ``` -------------------------------- ### Construct Dictionary using Dict Comprehension Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.md Benchmarks the time taken to construct a dictionary using a dictionary comprehension. This is a more modern Python syntax. ```python # %%timeit with open('../wordsegment_data/unigrams.txt') as reader: lines = (line.split('\t') for line in reader) {word: float(number) for word, number in lines} ``` -------------------------------- ### Segment Text using Command Line Interface Source: https://github.com/grantjenks/python-wordsegment/blob/master/README.rst Use the wordsegment module from the command line to segment text piped from stdin to stdout. Input and output can be redirected to files. ```bash echo thisisatest | python -m wordsegment ``` -------------------------------- ### Command-Line Batch Processing Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/index.md Use the WordSegment command-line interface for batch processing text files. Input is read from stdin and output is written to stdout by default. ```bash $ echo thisisatest | python -m wordsegment this is a test ``` -------------------------------- ### Import WordSegment Library Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/using-a-different-corpus.ipynb Imports the necessary wordsegment library to begin manipulating its components for custom corpus usage. ```python import wordsegment ``` -------------------------------- ### Convert Strings to Floats Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.md Benchmarks the time taken to convert the count strings to floating-point numbers. ```python # %%timeit with open('../wordsegment_data/unigrams.txt') as reader: lines = (line.split('\t') for line in reader) for word, number in lines: float(number) ``` -------------------------------- ### Inspect Default Dictionaries Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/using-a-different-corpus.md Prints the types of the UNIGRAMS and BIGRAMS dictionaries to confirm they are loaded correctly. ```python print type(wordsegment.UNIGRAMS), type(wordsegment.BIGRAMS) ``` -------------------------------- ### Load and Segment Text in Python Source: https://github.com/grantjenks/python-wordsegment/blob/master/README.rst Load the wordsegment data and segment a given phrase into a list of words. The `load` function should be called once before segmentation. ```python from wordsegment import load, segment load() segment('thisisatest') ``` -------------------------------- ### Read First Line of Unigram File Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.md Reads and prints the first line from the unigrams.txt file to show its format. ```python with open('../wordsegment_data/unigrams.txt', 'r') as reader: print repr(reader.readline()) ``` -------------------------------- ### Split Lines by Tab Character Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.md Benchmarks the time taken to split each line in the unigrams.txt file by the tab character. ```python # %%timeit with open('../wordsegment_data/unigrams.txt') as reader: lines = (line.split('\t') for line in reader) for word, number in lines: pass ``` -------------------------------- ### wordsegment.UNIGRAMS Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/api.md A mapping of unigram counts, loaded from 'wordsegment/unigrams.txt'. ```APIDOC ## wordsegment.UNIGRAMS ### Description Mapping of (unigram, count) pairs. Loaded from the file ‘wordsegment/unigrams.txt’. ### Parameters None ### Request Example None ### Response #### Success Response (200) - **UNIGRAMS** (dict) - A dictionary where keys are unigrams and values are their counts. ### Response Example None ``` -------------------------------- ### Read File and Split by Newline Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.md Reads the entire file content and splits it into lines based on the newline character. This is a benchmark for reading and splitting performance. ```python # %%timeit with open('../wordsegment_data/unigrams.txt') as reader: lines = reader.read().split('\n') ``` -------------------------------- ### Convert Unigrams to Binary Format Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.md Converts the unigrams.txt file into two separate files: 'words.txt' (ASCII words) and 'counts.bin' (binary counts) for potentially faster loading. ```python with open('../wordsegment_data/unigrams.txt') as reader: pairs = [line.split('\t') for line in reader] words = [pair[0] for pair in pairs] counts = [float(pair[1]) for pair in pairs] with open('words.txt', 'wb') as writer: writer.write('\n'.join(words)) from array import array values = array('d') values.fromlist(counts) with open('counts.bin', 'wb') as writer: values.tofile(writer) ``` -------------------------------- ### Explore Bigram Counts in Python Source: https://github.com/grantjenks/python-wordsegment/blob/master/README.rst Load the wordsegment data and explore the bigram counts using `heapq.nlargest`. Bigrams are phrases of two words joined by a space. ```python import heapq from pprint import pprint from operator import itemgetter ws.load() pprint(heapq.nlargest(10, ws.BIGRAMS.items(), itemgetter(1))) ``` -------------------------------- ### Time Converting Strings to Floats Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.ipynb Measures the time taken to convert the number strings obtained after splitting lines into floating-point numbers. This highlights another performance bottleneck. ```python %%timeit with open('../wordsegment_data/unigrams.txt') as reader: lines = (line.split('\t') for line in reader) for word, number in lines: float(number) ``` -------------------------------- ### Time Reading File and Splitting Lines Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.ipynb Measures the time taken to read the entire file content and then split it into lines. This method is often faster than reading line by line. ```python %%timeit with open('../wordsegment_data/unigrams.txt') as reader: lines = reader.read().split('\n') ``` -------------------------------- ### Time Reading All Lines from File Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.ipynb Measures the time taken to read all lines from the unigrams.txt file into a list. This is a faster alternative to line-by-line reading for certain use cases. ```python %%timeit with open('../wordsegment_data/unigrams.txt') as reader: lines = [line for line in reader] ``` -------------------------------- ### Explore Bigram Counts Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/index.md Examine the bigram counts using heapq.nlargest to find the most frequent bigrams. Bigrams are represented as strings with words joined by a space. ```python >>> import heapq >>> from pprint import pprint >>> from operator import itemgetter >>> pprint(heapq.nlargest(10, ws.BIGRAMS.items(), itemgetter(1))) [('of the', 2766332391.0), ('in the', 1628795324.0), ('to the', 1139248999.0), ('on the', 800328815.0), ('for the', 692874802.0), ('and the', 629726893.0), ('to be', 505148997.0), ('is a', 476718990.0), ('with the', 461331348.0), ('from the', 428303219.0)] ``` -------------------------------- ### Inspect Default Counts Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/using-a-different-corpus.ipynb Displays the types and initial entries of the unigram and bigram count dictionaries within the wordsegment library. This helps understand the structure before modification. ```python print type(wordsegment.unigram_counts), type(wordsegment.bigram_counts) ``` ```python print wordsegment.unigram_counts.items()[:3] print wordsegment.bigram_counts.items()[:3] ``` -------------------------------- ### Access Unigram Counts in Python Source: https://github.com/grantjenks/python-wordsegment/blob/master/README.rst Load the wordsegment data and access the unigram counts dictionary. This allows exploration of word frequencies. ```python import wordsegment as ws ws.load() ws.UNIGRAMS['the'] ws.UNIGRAMS['gray'] ws.UNIGRAMS['grey'] ``` -------------------------------- ### Load Unigrams into Dictionary Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.md Loads unigram data from a file into a Python dictionary, mapping words to their counts. This is the standard method used by the wordsegment module. ```python # %%timeit with open('../wordsegment_data/unigrams.txt') as reader: lines = (line.split('\t') for line in reader) dict((word, float(number)) for word, number in lines) ``` -------------------------------- ### Download Corpus Text Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/using-a-different-corpus.ipynb Fetches the text content of a Gutenberg ebook using the requests library. This is the first step in preparing a custom corpus for word segmentation. ```python import requests response = requests.get('https://www.gutenberg.org/ebooks/1342.txt.utf-8') text = response.text print len(text) ``` -------------------------------- ### Load and Segment Text Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/index.md Load the necessary data and segment a given phrase into a list of its constituent parts. The load function should be called once before segmenting text. ```python >>> from wordsegment import load, segment >>> load() >>> segment('thisisatest') ['this', 'is', 'a', 'test'] ``` -------------------------------- ### Explore Unigram Counts Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/index.md Access and inspect the unigram counts stored in the UNIGRAMS dictionary. This allows exploration of word frequencies. ```python >>> import wordsegment as ws >>> ws.load() >>> ws.UNIGRAMS['the'] 23135851162.0 >>> ws.UNIGRAMS['gray'] 21424658.0 >>> ws.UNIGRAMS['grey'] 18276942.0 ``` -------------------------------- ### Time Splitting Lines by Tab Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.ipynb Measures the time taken specifically for splitting each line of the file by the tab character. This isolates a performance bottleneck. ```python %%timeit with open('../wordsegment_data/unigrams.txt') as reader: lines = (line.split('\t') for line in reader) for word, number in lines: pass ``` -------------------------------- ### wordsegment.divide(text) Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/api.md Generates (prefix, suffix) pairs from the input text, where the length of the prefix does not exceed a specified limit. ```APIDOC ## wordsegment.divide(text) ### Description Yield (prefix, suffix) pairs from text with len(prefix) not exceeding limit. ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example None ### Response #### Success Response (200) - **pairs** (iterator) - An iterator yielding (prefix, suffix) pairs. ### Response Example None ``` -------------------------------- ### wordsegment.BIGRAMS Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/api.md A mapping of bigram counts, where bigram keys are joined by a space, loaded from 'wordsegment/bigrams.txt'. ```APIDOC ## wordsegment.BIGRAMS ### Description Mapping of (bigram, count) pairs. Bigram keys are joined by a space. Loaded from the file ‘wordsegment/bigrams.txt’. ### Parameters None ### Request Example None ### Response #### Success Response (200) - **BIGRAMS** (dict) - A dictionary where keys are bigrams (space-joined) and values are their counts. ### Response Example None ``` -------------------------------- ### wordsegment.segment(text) Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/api.md Returns a list of words representing the best segmentation of the input text. ```APIDOC ## wordsegment.segment(text) ### Description Return a list of words that is the best segmenation of text. ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example None ### Response #### Success Response (200) - **words** (list) - A list of the segmented words. ### Response Example None ``` -------------------------------- ### Time Dictionary Construction via Iteration Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/python-load-dict-fast-from-file.ipynb Measures the time taken to construct a dictionary by iterating through lines, splitting them, converting to float, and assigning to dictionary keys. This approach avoids tuple creation. ```python %%timeit with open('../wordsegment_data/unigrams.txt') as reader: lines = (line.split('\t') for line in reader) result = dict() for word, number in lines: result[word] = float(number) ``` -------------------------------- ### Build Custom Count Dictionaries Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/using-a-different-corpus.ipynb Updates the wordsegment library's unigram and bigram count dictionaries using tokenized text from a custom corpus. It also defines a helper function `pairs` to generate bigrams. ```python from collections import Counter wordsegment.unigram_counts = Counter(tokenize(text)) def pairs(iterable): iterator = iter(iterable) values = [next(iterator)] for value in iterator: values.append(value) yield ' '.join(values) del values[0] wordsegment.bigram_counts = Counter(pairs(tokenize(text))) ``` -------------------------------- ### wordsegment.score(word, prev=None) Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/api.md Scores a word based on its context, optionally considering the previous word. ```APIDOC ## wordsegment.score(word, prev=None) ### Description Score a word in the context of the previous word, prev. ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example None ### Response #### Success Response (200) - **score** (float) - The score of the word. ### Response Example None ``` -------------------------------- ### Clean and Segment Text in Python Source: https://github.com/grantjenks/python-wordsegment/blob/master/README.rst Clean input text to a canonical form before segmenting. The `clean` function removes punctuation and lowercases text. ```python from wordsegment import clean clean('She said, "Python rocks!"') segment('She said, "Python rocks!"') ``` -------------------------------- ### Clean and Segment Text Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/index.md Utilize the clean function to transform input text into a canonical form before segmentation. This removes punctuation and converts text to lowercase. ```python >>> from wordsegment import clean >>> clean('She said, "Python rocks!"') 'shesaidpythonrocks' >>> segment('She said, "Python rocks!"') ['she', 'said', 'python', 'rocks'] ``` -------------------------------- ### Tokenize Text Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/using-a-different-corpus.ipynb Defines a function to tokenize input text into words using regular expressions. This is a utility function used to prepare text for building count dictionaries. ```python import re def tokenize(text): pattern = re.compile('[a-zA-Z]+') return (match.group(0) for match in pattern.finditer(text)) print list(tokenize("Wait, what did you say?")) ``` -------------------------------- ### wordsegment.clean(text) Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/api.md Cleans the input text by converting it to lower case and removing non-alphanumeric characters. ```APIDOC ## wordsegment.clean(text) ### Description Return text lower-cased with non-alphanumeric characters removed. ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example None ### Response #### Success Response (200) - **text** (string) - The cleaned text. ### Response Example None ``` -------------------------------- ### Segment Text with Custom Corpus Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/using-a-different-corpus.ipynb Performs word segmentation on a given string using the updated wordsegment library, which now utilizes the custom corpus and modified cleaning function. ```python wordsegment.segment('wantofawife') ``` -------------------------------- ### Replace Cleaning Function Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/using-a-different-corpus.md Replaces the default `clean` function within the `_segmenter` module with an identity function. This prevents input sanitation, allowing segmentation of text with capitals and punctuation. ```python from wordsegment import _segmenter def identity(value): return value _segmenter.clean = identity ``` -------------------------------- ### wordsegment.isegment(text) Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/api.md Returns an iterator of words representing the best segmentation of the input text. ```APIDOC ## wordsegment.isegment(text) ### Description Return iterator of words that is the best segmenation of text. ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example None ### Response #### Success Response (200) - **words** (iterator) - An iterator yielding the segmented words. ### Response Example None ``` -------------------------------- ### Update WordSegment Dictionaries Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/using-a-different-corpus.md Clears the existing UNIGRAMS and BIGRAMS dictionaries and populates them with counts derived from the custom corpus text. Requires the `tokenize` function and `pairs` helper. ```python from collections import Counter wordsegment.UNIGRAMS.clear() wordsegment.UNIGRAMS.update(Counter(tokenize(text))) def pairs(iterable): iterator = iter(iterable) values = [next(iterator)] for value in iterator: values.append(value) yield ' '.join(values) del values[0] wordsegment.BIGRAMS.clear() wordsegment.BIGRAMS.update(Counter(pairs(tokenize(text)))) ``` -------------------------------- ### Update Total Unigram Count Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/using-a-different-corpus.md Sets the `_segmenter.total` variable to the sum of all unigram counts from the custom corpus. This is necessary for accurate probability calculations during segmentation. ```python _segmenter.total = float(sum(wordsegment.UNIGRAMS.values())) ``` -------------------------------- ### Replace Cleaning Function Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/using-a-different-corpus.ipynb Replaces the default text cleaning function in wordsegment with an identity function. This prevents any pre-processing of input strings before segmentation, useful when the corpus has specific formatting. ```python def identity(value): return value wordsegment.clean = identity ``` -------------------------------- ### Update Total Count Source: https://github.com/grantjenks/python-wordsegment/blob/master/docs/using-a-different-corpus.ipynb Adjusts the `wordsegment.TOTAL` variable to reflect the sum of all unigram counts from the custom corpus. This is important for accurate segmentation probabilities when using a new corpus. ```python wordsegment.TOTAL = float(sum(wordsegment.unigram_counts.values())) ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.