### Initialize and Use THULAC for Segmentation Source: https://github.com/thunlp/thulac-python/blob/master/README.md Demonstrates how to initialize the THULAC analyzer and perform word segmentation on a string or a file. The first example returns tagged text, while the second performs segmentation-only processing on a file. ```python import thulac # Default mode: segmentation and POS tagging thu1 = thulac.thulac() text = thu1.cut("我爱北京天安门", text=True) print(text) # Segmentation only mode thu1 = thulac.thulac(seg_only=True) thu1.cut_f("input.txt", "output.txt") ``` -------------------------------- ### Command-Line Interface (run()) Source: https://context7.com/thunlp/thulac-python/llms.txt Starts an interactive command-line segmentation program that reads from standard input and outputs results in real-time. ```APIDOC ## run() - Command-Line Interactive Mode ### Description Launches an interactive command-line segmentation program. It reads text from standard input and outputs the segmentation results immediately. The program exits when an empty line is entered or upon receiving an EOF signal. ### Method `thulac_instance.run()` ### Request Example ```python import thulac # Initialize THULAC thu = thulac.thulac() # Start interactive mode thu.run() # Interactive Example: # Input: 我爱北京天安门 # Output: 我_r 爱_v 北京_ns 天安门_ns # Input: (Empty line or Ctrl+D to exit) ``` ``` -------------------------------- ### Part-of-Speech (POS) Tagging Explanation (Python) Source: https://context7.com/thunlp/thulac-python/llms.txt Provides a detailed explanation of the part-of-speech tags used by THULAC. It includes a Python code example demonstrating how to get POS tags for a sentence and a comprehensive list of all tag labels and their meanings. ```python import thulac thu = thulac.thulac() # POS tagging example text = "孙茂松教授在清华大学讲授自然语言处理课程" result = thu.cut(text) print(result) # Example output: # [['孙茂松', 'np'], # np: Person name # ['教授', 'n'], # n: Noun # ['在', 'p'], # p: Preposition # ['清华大学', 'ni'], # ni: Organization name # ['讲授', 'v'], # v: Verb # ['自然', 'n'], # n: Noun # ['语言', 'n'], # n: Noun # ['处理', 'v'], # v: Verb # ['课程', 'n']] # n: Noun # Full list of POS tags: # n/Noun np/Person name ns/Location name ni/Organization name nz/Other proper noun # m/Numeral q/Quantifier mq/Quantity t/Time f/Locative s/Place # v/Verb a/Adjective d/Adverb h/Pre-component k/Post-component # i/Idiom j/Abbreviation r/Pronoun c/Conjunction p/Preposition u/Auxiliary y/Modal particle # e/Interjection o/Onomatopoeia g/Morpheme w/Punctuation x/Other # uw/User dictionary word (tagged when using user_dict) ``` -------------------------------- ### File Segmentation with cut_f() (Python) Source: https://context7.com/thunlp/thulac-python/llms.txt Demonstrates how to segment an entire file using the `cut_f()` method, reading from an input file and writing results to an output file. This is suitable for processing large amounts of text data. Examples cover default mode and segmentation-only mode. ```python import thulac # Initialize segmenter thu = thulac.thulac() # Segment a file (segmentation + POS tagging) thu.cut_f("input.txt", "output.txt") # Input file input.txt content: "我爱北京天安门" # Output file output.txt content: "我_r 爱_v 北京_ns 天安门_ns" # Segmentation-only mode file processing thu_seg = thulac.thulac(seg_only=True) thu_seg.cut_f("input.txt", "output_seg.txt") # Output file output_seg.txt content: "我 爱 北京 天安门" # A success message is printed after processing # "successfully cut file input.txt!" ``` -------------------------------- ### Command-Line Interface (CLI) Usage (Bash) Source: https://context7.com/thunlp/thulac-python/llms.txt Illustrates how to use THULAC directly from the command line for file segmentation. This is convenient for users who have installed the package via pip. It supports specifying input/output files and segmentation modes. ```bash # Basic usage: segmentation + POS tagging python -m thulac input.txt output.txt # Segmentation-only mode (no POS tagging) python -m thulac input.txt output.txt -seg_only # Example input file input.txt: # 清华大学是中国著名的高等学府 # Output file output.txt (default mode): # 清华大学_ni 是_v 中国_ns 著名_a 的_u 高等_a 学府_n # Output file output.txt (segmentation-only mode): # 清华大学 是 中国 著名 的 高等 学府 ``` -------------------------------- ### Single Sentence Segmentation with cut() (Python) Source: https://context7.com/thunlp/thulac-python/llms.txt Shows how to perform word segmentation on a single sentence using the `cut()` method. It supports returning results as a list of word-tag pairs or as a space-separated string. Examples include default mode, segmentation-only mode, and handling multi-line text. ```python import thulac # Initialize segmenter (segmentation + POS tagging mode) thu = thulac.thulac() # Return as a list of [word, tag] pairs (default) result = thu.cut("我爱北京天安门") print(result) # Output: [['我', 'r'], ['爱', 'v'], ['北京', 'ns'], ['天安门', 'ns']] # Return as a space-separated text string result_text = thu.cut("我爱北京天安门", text=True) print(result_text) # Output: "我_r 爱_v 北京_ns 天安门_ns" # Segmentation-only mode thu_seg = thulac.thulac(seg_only=True) result_seg = thu_seg.cut("我爱北京天安门") print(result_seg) # Output: [['我', ''], ['爱', ''], ['北京', ''], ['天安门', '']] result_seg_text = thu_seg.cut("我爱北京天安门", text=True) print(result_seg_text) # Output: "我 爱 北京 天安门" # Processing multi-line text multi_line = "今天天气真好\n我们去公园玩吧" result_multi = thu.cut(multi_line, text=True) print(result_multi) # Output: # 今天_t 天气_n 真_d 好_a # 我们_r 去_v 公园_n 玩_v 吧_y ``` -------------------------------- ### User Custom Dictionary Usage (Python) Source: https://context7.com/thunlp/thulac-python/llms.txt Explains how to enhance THULAC's segmentation accuracy by using a custom user dictionary. The dictionary should contain one word per line in UTF-8 encoding. The example shows initializing THULAC with a user dictionary and comparing its output to segmentation without one. ```python # Create user dictionary file user_dict.txt: # 深度学习 # 自然语言处理 # 机器翻译 # ChatGPT import thulac # Initialize with user dictionary thu = thulac.thulac(user_dict="user_dict.txt") # Segmentation result without dictionary thu_no_dict = thulac.thulac() result1 = thu_no_dict.cut("深度学习是人工智能的核心技术", text=True) print(result1) # Example output (may vary based on THULAC version and model): # 深度学习_nl 是_v 人工_n 智能_n 的_u 核心_n 技术_n # Segmentation result with dictionary result2 = thu.cut("深度学习是人工智能的核心技术", text=True) print(result2) # Example output (showing '深度学习' as a single unit): # 深度学习_nl 是_v 人工智能_nl 的_u 核心_n 技术_n ``` -------------------------------- ### Command-Line Invocation Source: https://context7.com/thunlp/thulac-python/llms.txt Directly invokes THULAC via the command line for file segmentation, suitable for users who have installed the package via pip. ```APIDOC ## Command-Line Invocation ### Description THULAC can be invoked directly from the command line to process files. This is convenient for users who have installed the package using pip. You can specify input and output files, as well as segmentation modes. ### Usage `python -m thulac [options]` ### Options - `-seg_only`: Perform segmentation only, without part-of-speech tagging. ### Examples **Basic Usage (Segmentation + POS Tagging):** ```bash python -m thulac input.txt output.txt ``` **Segmentation Only Mode:** ```bash python -m thulac input.txt output.txt -seg_only ``` **Example Input File (`input.txt`):** ``` 清华大学是中国著名的高等学府 ``` **Example Output File (`output.txt`) (Default Mode):** ``` 清华大学_ni 是_v 中国_ns 著名_a 的_u 高等_a 学府_n ``` **Example Output File (`output.txt`) (Segmentation Only Mode):** ``` 清华大学 是 中国 著名 的 高等 学府 ``` ``` -------------------------------- ### Initialize THULAC Analyzer (Python) Source: https://context7.com/thunlp/thulac-python/llms.txt Demonstrates how to initialize the THULAC analyzer with various configuration options. This includes setting segmentation-only mode, enabling Traditional to Simplified Chinese conversion, using custom dictionaries, applying filters, customizing delimiters, specifying model paths, and removing spaces. ```python import thulac # Default mode: segmentation + POS tagging thu_default = thulac.thulac() # Segmentation-only mode (faster, no POS tagging) thu_seg = thulac.thulac(seg_only=True) # Traditional to Simplified Chinese conversion + segmentation thu_t2s = thulac.thulac(T2S=True) # Using a custom user dictionary (each word per line, UTF-8 encoded) thu_dict = thulac.thulac(user_dict="user_dict.txt") # Enable filter to remove meaningless words thu_filt = thulac.thulac(filt=True) # Custom delimiter between word and tag (default is '_') thu_custom = thulac.thulac(deli='/') # Specify model file path thu_model = thulac.thulac(model_path="/path/to/models/") # Remove spaces in original text before segmentation thu_rmspace = thulac.thulac(rm_space=True) # Full parameter example thu_full = thulac.thulac( user_dict="custom_dict.txt", # Path to user dictionary model_path="models/", # Path to model folder T2S=False, # Traditional to Simplified Chinese seg_only=False, # Segmentation-only mode filt=False, # Filter meaningless words deli='_', # Delimiter between word and tag rm_space=False # Remove spaces ) ``` -------------------------------- ### Command Line Usage Source: https://github.com/thunlp/thulac-python/blob/master/README.md Shows how to run the THULAC tool directly from the command line for batch processing of text files. ```bash python -m thulac input.txt output.txt # Run with segmentation only python -m thulac input.txt output.txt seg_only ``` -------------------------------- ### THULAC Initialization (thulac()) Source: https://context7.com/thunlp/thulac-python/llms.txt Initializes the THULAC analyzer with various configuration options for segmentation and tagging. ```APIDOC ## thulac() - Initialize THULAC Analyzer ### Description Creates an instance of the THULAC analyzer. Various parameters can be set to customize segmentation behavior, including user dictionaries, simplified/traditional Chinese conversion, segmentation-only mode, and word filtering. The model files are loaded automatically during initialization. ### Method `thulac.thulac()` ### Parameters - **user_dict** (string) - Optional - Path to a custom user dictionary file (UTF-8 encoded, one word per line). - **model_path** (string) - Optional - Path to the directory containing model files. - **T2S** (boolean) - Optional - If True, converts traditional Chinese characters to simplified Chinese. - **seg_only** (boolean) - Optional - If True, performs segmentation only without part-of-speech tagging (faster). - **filt** (boolean) - Optional - If True, filters out meaningless words. - **deli** (string) - Optional - Delimiter character between words and their tags (default is '_'). - **rm_space** (boolean) - Optional - If True, removes spaces from the original text before segmentation. ### Request Example ```python import thulac # Default mode: segmentation + POS tagging thu_default = thulac.thulac() # Segmentation only mode (faster) thu_seg = thulac.thulac(seg_only=True) # Traditional to Simplified Chinese conversion + segmentation thu_t2s = thulac.thulac(T2S=True) # Using a custom user dictionary thu_dict = thulac.thulac(user_dict="user_dict.txt") # Enabling filter to remove meaningless words thu_filt = thulac.thulac(filt=True) # Custom delimiter between word and tag (default is '_') thu_custom = thulac.thulac(deli='/') # Specifying model file path thu_model = thulac.thulac(model_path="/path/to/models/") # Remove spaces before segmentation thu_rmspace = thulac.thulac(rm_space=True) # Full parameter example thu_full = thulac.thulac( user_dict="custom_dict.txt", model_path="models/", T2S=False, seg_only=False, filt=False, deli='_', rm_space=False ) ``` ``` -------------------------------- ### Run Interactive Mode (Python) Source: https://context7.com/thunlp/thulac-python/llms.txt Demonstrates how to launch THULAC in an interactive command-line mode using the `run()` method. This allows users to input text directly and receive immediate segmentation results. The program exits on an empty line or EOF. ```python import thulac # Start interactive mode thu = thulac.thulac() thu.run() # Interactive example: # Input: 我爱北京天安门 # Output: 我_r 爱_v 北京_ns 天安门_ns # Input: (Empty line or Ctrl+D to exit) ``` -------------------------------- ### Fast File Segmentation with fast_cut_f() (Python) Source: https://context7.com/thunlp/thulac-python/llms.txt Presents the `fast_cut_f()` method for high-speed file segmentation using the C extension. This method is optimized for large-scale text file processing and offers better performance than the standard `cut_f()`. It also requires the pre-compiled `libthulac.so`. ```python import thulac # Initialize segmenter thu = thulac.thulac() # Fast file segmentation thu.fast_cut_f("input.txt", "output.txt") # Prints: "successfully cut file input.txt!" # Segmentation-only mode thu_seg = thulac.thulac(seg_only=True) thu_seg.fast_cut_f("large_input.txt", "large_output.txt") ``` -------------------------------- ### Fast Single Sentence Segmentation with fast_cut() (Python) Source: https://context7.com/thunlp/thulac-python/llms.txt Introduces the `fast_cut()` method, which utilizes a C extension for significantly faster sentence segmentation compared to `cut()`. It requires pre-compiled `libthulac.so`. The interface is identical to `cut()`, supporting both list and text output formats. ```python import thulac # Initialize segmenter thu = thulac.thulac() # Fast segmentation (returns list of pairs) result = thu.fast_cut("我爱北京天安门") print(result) # Output format is the same as cut() # Fast segmentation (returns text) result_text = thu.fast_cut("我爱北京天安门", text=True) print(result_text) # Output: "我_r 爱_v 北京_ns 天安门_ns" # Fast segmentation in segmentation-only mode thu_seg = thulac.thulac(seg_only=True) result_seg = thu_seg.fast_cut("我爱北京天安门", text=True) print(result_seg) # Output: "我 爱 北京 天安门" ``` -------------------------------- ### Fast File Segmentation (fast_cut_f()) Source: https://context7.com/thunlp/thulac-python/llms.txt Provides a faster file segmentation interface using C extensions, suitable for large-scale text processing. ```APIDOC ## fast_cut_f() - Fast File Segmentation ### Description This is a high-speed file segmentation interface implemented using C extensions. It requires the `libthulac.so` file to be pre-compiled and is suitable for processing large text files, offering better performance than the standard `cut_f()`. ### Method `thulac_instance.fast_cut_f(input_filepath, output_filepath)` ### Parameters - **input_filepath** (string) - The path to the input text file. - **output_filepath** (string) - The path to the output file where segmented results will be written. ### Request Example ```python import thulac # Initialize THULAC thu = thulac.thulac() # Fast file segmentation thu.fast_cut_f("input.txt", "output.txt") # A success message will be printed upon completion # "successfully cut file input.txt!" # Segmentation only mode thu_seg = thulac.thulac(seg_only=True) thu_seg.fast_cut_f("large_input.txt", "large_output.txt") ``` ``` -------------------------------- ### Fast Single Sentence Segmentation (fast_cut()) Source: https://context7.com/thunlp/thulac-python/llms.txt Provides a faster single sentence segmentation interface using C extensions. ```APIDOC ## fast_cut() - Fast Single Sentence Segmentation ### Description This is a high-speed segmentation interface implemented using C extensions, offering faster performance than the standard `cut()` method. It requires the `libthulac.so` file to be pre-compiled and placed in the same directory as the models. The interface parameters are identical to `cut()`. ### Method `thulac_instance.fast_cut(sentence, text=False)` ### Parameters - **sentence** (string) - The input sentence to segment. - **text** (boolean) - Optional - If True, returns the result as a space-separated string. Defaults to False (returns `[[word, tag], ...]`). ### Request Example ```python import thulac # Initialize THULAC thu = thulac.thulac() # Fast segmentation (returns list of pairs) result = thu.fast_cut("我爱北京天安门") print(result) # Output format is the same as cut() # Fast segmentation (returns string) result_text = thu.fast_cut("我爱北京天安门", text=True) print(result_text) # Output: "我_r 爱_v 北京_ns 天安门_ns" # Fast segmentation in segmentation-only mode thu_seg = thulac.thulac(seg_only=True) result_seg = thu_seg.fast_cut("我爱北京天安门", text=True) print(result_seg) # Output: "我 爱 北京 天安门" ``` ``` -------------------------------- ### User-Defined Dictionary Source: https://context7.com/thunlp/thulac-python/llms.txt Explains how to use a custom dictionary to improve segmentation accuracy for specific terms and new words. ```APIDOC ## User-Defined Dictionary ### Description By providing a user-defined dictionary, you can enhance the segmentation accuracy for specialized terms, new words, or domain-specific vocabulary. The dictionary file should be UTF-8 encoded, with each word listed on a new line. ### Usage Initialize the `thulac` object with the `user_dict` parameter pointing to your dictionary file. ### Request Example ```python import thulac # Create a user dictionary file named 'user_dict.txt': # 深度学习 # 自然语言处理 # 机器翻译 # ChatGPT # Initialize THULAC with the user dictionary thu = thulac.thulac(user_dict="user_dict.txt") # Example segmentation using the custom dictionary result_with_dict = thu.cut("深度学习是人工智能的核心技术", text=True) print(result_with_dict) # Expected output might include '深度学习' as a single token. # For comparison, segmentation without the custom dictionary: thu_no_dict = thulac.thulac() result_no_dict = thu_no_dict.cut("深度学习是人工智能的核心技术", text=True) print(result_no_dict) # Expected output might segment '深度' and '学习' separately. ``` ``` -------------------------------- ### Part-of-Speech Tagging Explanation Source: https://context7.com/thunlp/thulac-python/llms.txt Explains the part-of-speech tagging system used by THULAC, providing a list of tags and their meanings. ```APIDOC ## Part-of-Speech Tagging Explanation ### Description THULAC uses a comprehensive part-of-speech tagging system that covers most word classes found in Chinese text. Below is an explanation of the tags and an example of their application. ### Tag Set - **n**: Noun - **np**: Proper Noun (Person Name) - **ns**: Proper Noun (Place Name) - **ni**: Proper Noun (Organization Name) - **nz**: Proper Noun (Other Specific Name) - **m**: Numeral - **q**: Measure Word - **mq**: Quantity Measure Word - **t**: Time Word - **f**: Directional Word - **s**: Locative Word - **v**: Verb - **a**: Adjective - **d**: Adverb - **h**: Preceding Component - **k**: Following Component - **i**: Idiom - **j**: Abbreviation - **r**: Pronoun - **c**: Conjunction - **p**: Preposition - **u**: Auxiliary Particle - **y**: Sentence-final Particle - **e**: Interjection - **o**: Onomatopoeia - **g**: Morpheme - **w**: Punctuation - **x**: Other - **uw**: User-defined word (tagged when using `user_dict`) ### Example ```python import thulac thu = thulac.thulac() text = "孙茂松教授在清华大学讲授自然语言处理课程" result = thu.cut(text) print(result) # Output Example: # [['孙茂松', 'np'], # np: Person Name # ['教授', 'n'], # n: Noun # ['在', 'p'], # p: Preposition # ['清华大学', 'ni'], # ni: Organization Name # ['讲授', 'v'], # v: Verb # ['自然', 'n'], # n: Noun # ['语言', 'n'], # n: Noun # ['处理', 'v'], # v: Verb # ['课程', 'n']] # n: Noun ``` ``` -------------------------------- ### File Segmentation (cut_f()) Source: https://context7.com/thunlp/thulac-python/llms.txt Processes an entire file, reading text from an input file and writing segmented results to an output file. ```APIDOC ## cut_f() - Segment an Entire File ### Description Processes a whole file by reading content from an input file and writing the segmentation results to an output file. This method is suitable for handling large amounts of text data and supports UTF-8 encoded text files. ### Method `thulac_instance.cut_f(input_filepath, output_filepath)` ### Parameters - **input_filepath** (string) - The path to the input text file. - **output_filepath** (string) - The path to the output file where segmented results will be written. ### Request Example ```python import thulac # Initialize THULAC thu = thulac.thulac() # Segment a file (segmentation + POS tagging) thu.cut_f("input.txt", "output.txt") # Input file input.txt content: "我爱北京天安门" # Output file output.txt content: "我_r 爱_v 北京_ns 天安门_ns" # Segmentation only mode for files thu_seg = thulac.thulac(seg_only=True) thu_seg.cut_f("input.txt", "output_seg.txt") # Output file output_seg.txt content: "我 爱 北京 天安门" # A success message will be printed upon completion # "successfully cut file input.txt!" ``` ``` -------------------------------- ### POST /cut Source: https://context7.com/thunlp/thulac-python/llms.txt Performs lexical analysis on a given input string, returning segmented text with part-of-speech tags. ```APIDOC ## POST /cut ### Description Segments a Chinese sentence and performs part-of-speech tagging. Supports custom dictionaries and text-only output formats. ### Method POST ### Endpoint /cut ### Parameters #### Request Body - **text** (string) - Required - The Chinese text to be analyzed. - **text_only** (boolean) - Optional - If true, returns only the segmented words without POS tags. ### Request Example { "text": "深度学习是人工智能的核心技术", "text_only": false } ### Response #### Success Response (200) - **result** (string) - The segmented and tagged string. #### Response Example { "result": "深度_n 学习_v 是_v 人工_b 智能_n 的_u 核心_n 技术_n" } ``` -------------------------------- ### Perform Chinese Word Segmentation with Custom Dictionary Source: https://context7.com/thunlp/thulac-python/llms.txt This snippet demonstrates how to use the THULAC Python API to segment a string while applying a user-defined dictionary. Words found in the user dictionary are tagged with the 'uw' label, allowing for domain-specific entity recognition. ```python import thulac # Initialize THULAC thu = thulac.thulac(user_dict='user_dict.txt') # Perform segmentation with custom dictionary result2 = thu.cut("深度学习是人工智能的核心技术", text=True) print(result2) # Expected output: "深度学习_uw 是_v 人工智能_n 的_u 核心_n 技术_n" ``` -------------------------------- ### Single Sentence Segmentation (cut()) Source: https://context7.com/thunlp/thulac-python/llms.txt Segments a single sentence into words and optionally tags them with part-of-speech. ```APIDOC ## cut() - Segment a Single Sentence ### Description Performs Chinese word segmentation on a given sentence. It can return the result as a list of `[word, tag]` pairs (default) or as a space-separated string if `text=True` is specified. This method supports both segmentation and part-of-speech tagging, or segmentation only if the analyzer was initialized with `seg_only=True`. ### Method `thulac_instance.cut(sentence, text=False)` ### Parameters - **sentence** (string) - The input sentence to segment. - **text** (boolean) - Optional - If True, returns the result as a space-separated string. Defaults to False (returns `[[word, tag], ...]`). ### Request Example ```python import thulac # Initialize THULAC (segmentation + POS tagging mode) thu = thulac.thulac() # Return as a list of [word, tag] pairs (default) result = thu.cut("我爱北京天安门") print(result) # Output: [['我', 'r'], ['爱', 'v'], ['北京', 'ns'], ['天安门', 'ns']] # Return as a space-separated string result_text = thu.cut("我爱北京天安门", text=True) print(result_text) # Output: "我_r 爱_v 北京_ns 天安门_ns" # Segmentation only mode thu_seg = thulac.thulac(seg_only=True) result_seg = thu_seg.cut("我爱北京天安门") print(result_seg) # Output: [['我', ''], ['爱', ''], ['北京', ''], ['天安门', '']] result_seg_text = thu_seg.cut("我爱北京天安门", text=True) print(result_seg_text) # Output: "我 爱 北京 天安门" # Processing multi-line text multi_line = "今天天气真好\n我们去公园玩吧" result_multi = thu.cut(multi_line, text=True) print(result_multi) # Output: # 今天_t 天气_n 真_d 好_a # 我们_r 去_v 公园_n 玩_v 吧_y ``` ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.