### Installing SentencePiece and Preparing Data (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet installs the `sentencepiece` Python library using pip and downloads a sample text file (`botchan.txt`) from GitHub. The text file will be used as training data for the SentencePiece model. These are prerequisite steps for running the subsequent examples. ```python !pip install sentencepiece !wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt ``` -------------------------------- ### Installing Protobuf for SentencePiece Byte Offset Extraction (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet shows the command to install the `protobuf` Python module, which is a prerequisite for extracting byte offsets and other metadata from SentencePiece models. ```Python !pip install protobuf ``` -------------------------------- ### Installing SentencePiece Python Module via Pip Source: https://github.com/google/sentencepiece/blob/master/README.md This command provides the standard method for installing the SentencePiece Python wrapper. Executing pip install sentencepiece will download and install the necessary package, enabling access to SentencePiece's training and segmentation functionalities within Python environments. ```Shell pip install sentencepiece ``` -------------------------------- ### Installing SentencePiece Python Module via pip Source: https://github.com/google/sentencepiece/blob/master/python/README.md This command installs the SentencePiece Python module directly using pip, suitable for Linux, macOS, and Windows environments. It's the simplest way to get started with SentencePiece. ```Shell pip install sentencepiece ``` -------------------------------- ### Installing SentencePiece using vcpkg Source: https://github.com/google/sentencepiece/blob/master/README.md This sequence of commands demonstrates how to use the vcpkg dependency manager to download, build, and install SentencePiece. It covers cloning vcpkg, bootstrapping it, integrating it with your development environment, and finally installing the SentencePiece port. ```Shell git clone https://github.com/Microsoft/vcpkg.git cd vcpkg ./bootstrap-vcpkg.sh ./vcpkg integrate install ./vcpkg install sentencepiece ``` -------------------------------- ### Building and Installing SentencePiece Python Wrapper from Source (Windows) Source: https://github.com/google/sentencepiece/blob/master/python/README.md These commands build and install the SentencePiece Python wrapper from source on Windows using Visual Studio Developer PowerShell. It includes cloning the repository, configuring with CMake, building, installing, and then building and installing the Python wheel package using PowerShell commands. ```PowerShell git clone https://github.com/google/sentencepiece.git cd sentencepiece mkdir build cd build cmake .. -DSPM_ENABLE_SHARED=OFF -DCMAKE_INSTALL_PREFIX=".\\root" cmake --build . --config Release --target install cd ../python pip install wheel python setup.py bdist_wheel Get-ChildItem .\dist\sentencepiece*.whl | ForEach-Object { pip install $_.FullName } ``` -------------------------------- ### Installing SentencePiece Python Wheel Source: https://github.com/google/sentencepiece/blob/master/README.md This command installs a SentencePiece Python wheel file (.whl) downloaded from the GitHub releases page using pip, the Python package installer. ```Python pip install wheel_file.whl ``` -------------------------------- ### Building and Installing SentencePiece Python Wrapper from Source (Linux/macOS) Source: https://github.com/google/sentencepiece/blob/master/python/README.md These commands build and install the SentencePiece Python wrapper from source on Linux and macOS. It involves cloning the repository, configuring with CMake, compiling, installing, and then building and installing the Python wheel package. ```Shell git clone https://github.com/google/sentencepiece.git cd sentencepiece mkdir build cd build cmake .. -DSPM_ENABLE_SHARED=OFF -DCMAKE_INSTALL_PREFIX=./root make install cd ../python python setup.py bdist_wheel pip install dist/sentencepiece*.whl ``` -------------------------------- ### Installing Build Dependencies on Ubuntu Source: https://github.com/google/sentencepiece/blob/master/README.md This command installs the necessary build tools and libraries for SentencePiece on Ubuntu, including CMake, essential build utilities, pkg-config, and the optional gperftools library for performance improvements. ```Shell sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev ``` -------------------------------- ### Building and Installing SentencePiece from C++ Source Source: https://github.com/google/sentencepiece/blob/master/README.md These commands outline the process to clone the SentencePiece repository, create a build directory, configure the build with CMake, compile the source code using Make, and install the command-line tools system-wide. It also includes post-installation steps for library linking. ```Shell git clone https://github.com/google/sentencepiece.git cd sentencepiece mkdir build cd build cmake .. make -j $(nproc) sudo make install sudo ldconfig -v ``` -------------------------------- ### Installing SentencePiece Python Wrapper to User Directory Source: https://github.com/google/sentencepiece/blob/master/python/README.md This command installs the SentencePiece Python wrapper into the user's local site-packages directory, avoiding the need for global write permissions. It's useful when installing without root privileges. ```Shell python setup.py install --user ``` -------------------------------- ### Setting Default Installation Directory Variables Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt This snippet ensures that standard CMake installation directory variables (CMAKE_INSTALL_BINDIR, CMAKE_INSTALL_LIBDIR, CMAKE_INSTALL_INCLUDEDIR) are defined. If they are not already set, they are assigned default values like 'bin', 'lib', and 'include' respectively, providing consistent installation paths. ```CMake if (NOT DEFINED CMAKE_INSTALL_BINDIR) set(CMAKE_INSTALL_BINDIR bin) endif() if (NOT DEFINED CMAKE_INSTALL_LIBDIR) set(CMAKE_INSTALL_LIBDIR lib) endif() if (NOT DEFINED CMAKE_INSTALL_INCLUDEDIR) set(CMAKE_INSTALL_INCLUDEDIR include) endif() ``` -------------------------------- ### Installing SentencePiece Header Files (CMake) Source: https://github.com/google/sentencepiece/blob/master/src/CMakeLists.txt This snippet installs the public header files for SentencePiece (sentencepiece_trainer.h, sentencepiece_processor.h) to the standard include directory. It also conditionally installs Protobuf-related headers if an external Protobuf provider is used. ```CMake install(FILES sentencepiece_trainer.h sentencepiece_processor.h DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}) if (NOT SPM_PROTOBUF_PROVIDER STREQUAL "internal") install(FILES ${SPM_PROTO_HDRS} DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}) endif() ``` -------------------------------- ### Training SentencePiece Model from Word Frequency List (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This example illustrates how to prepare a TSV file containing words and their frequencies, and then train a SentencePiece model using this file with the `input_format=tsv` option. It concludes by demonstrating text encoding with the trained model. ```Python freq = {} with open('botchan.txt', 'r') as f: for line in f: line = line.rstrip() for piece in line.split(): freq.setdefault(piece, 0) freq[piece] += 1 with open('word_freq_list.tsv', 'w') as f: for k, v in freq.items(): f.write('%s\t%d\n' % (k, v)) spm.SentencePieceTrainer.train('--input=word_freq_list.tsv --input_format=tsv --model_prefix=m --vocab_size=2000') sp = spm.SentencePieceProcessor() sp.load('m.model') print(sp.encode_as_pieces('this is a test.')) ``` -------------------------------- ### Retrieving N-Best Segmentations in SentencePiece (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This example illustrates how to retrieve the top N-best segmentations for a given input text using SentencePiece's nbest_encode_as_pieces and nbest_encode_as_ids methods. It demonstrates getting the 10 best segmentations, providing alternative tokenizations for applications like subword regularization. ```python print(sp.nbest_encode_as_pieces('hello world', 10)) print(sp.nbest_encode_as_ids('hello world', 10)) ``` -------------------------------- ### Using join_paths for Installation Directory Configuration (CMake) Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt Calls the `join_paths` function to construct paths for `libdir_for_pc_file` and `includedir_for_pc_file`. These paths are typically used in `.pc` (pkg-config) files to specify installation directories relative to `exec_prefix` and `prefix`. ```CMake join_paths(libdir_for_pc_file "\${exec_prefix}" "${CMAKE_INSTALL_LIBDIR}") join_paths(includedir_for_pc_file "\${prefix}" "${CMAKE_INSTALL_INCLUDEDIR}") ``` -------------------------------- ### Configuring Conditional Installation for Targets (CMake) Source: https://github.com/google/sentencepiece/blob/master/src/CMakeLists.txt This section defines the installation rules for the SPM_INSTALLTARGETS based on the operating system. It specifies different installation destinations and bundle options for iOS versus other systems, ensuring proper deployment. ```CMake if (CMAKE_SYSTEM_NAME STREQUAL "iOS") install(TARGETS ${SPM_INSTALLTARGETS} BUNDLE DESTINATION ${CMAKE_INSTALL_BINDIR} RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR} LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR} ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}) else() install(TARGETS ${SPM_INSTALLTARGETS} RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR} LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR} ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}) endif() ``` -------------------------------- ### Configuring Platform-Specific Installation Paths Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt This block configures installation directory variables (prefix, exec_prefix, libdir, includedir) based on the operating system. For Unix systems, it includes GNUInstallDirs for standard paths; otherwise, it sets generic paths. It also sets a GNUCXX_STD_SUPPORT_VERSION. ```CMake if (UNIX) include(GNUInstallDirs) set(prefix ${CMAKE_INSTALL_PREFIX}) set(exec_prefix "\${prefix}") set(libdir "\${exec_prefix}/${CMAKE_INSTALL_LIBDIR}") set(includedir "\${prefix}/${CMAKE_INSTALL_INCLUDEDIR}") else() set(prefix ${CMAKE_INSTALL_PREFIX}) set(exec_prefix "\${prefix}") set(libdir "\${exec_prefix}/lib") set(includedir "\${prefix}/include") endif() set(GNUCXX_STD_SUPPORT_VERSION "4.3") ``` -------------------------------- ### Appending Core Executables to Install Targets List (CMake) Source: https://github.com/google/sentencepiece/blob/master/src/CMakeLists.txt This command appends the core SentencePiece executables to the SPM_INSTALLTARGETS list. This list is later used by the install command to specify which targets should be installed. ```CMake list(APPEND SPM_INSTALLTARGETS spm_encode spm_decode spm_normalize spm_train spm_export_vocab) ``` -------------------------------- ### Training and Using SentencePiece Character Model in Python Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet demonstrates how to train a character-level segmentation model using SentencePiece and then apply it to segment text. It shows both `encode_as_pieces` to get character pieces and `encode_as_ids` to get their corresponding vocabulary IDs. The model is trained on `botchan.txt` with a vocabulary size of 2000, treating each character as a unit. ```Python spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_char --model_type=char --vocab_size=2000') sp_char = spm.SentencePieceProcessor() sp_char.load('m_char.model') print(sp_char.encode_as_pieces('this is a test.')) print(sp_char.encode_as_ids('this is a test.')) ``` -------------------------------- ### Extracting Byte Offsets and N-Best Results from SentencePiece (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This example demonstrates how to train a SentencePiece model and then use `encode_as_serialized_proto` and `nbest_encode_as_serialized_proto` methods to retrieve token byte offsets and N-best segmentation results, parsed using the `sentencepiece_pb2` module. ```Python from sentencepiece import sentencepiece_pb2 spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000') sp = spm.SentencePieceProcessor() sp.load('m.model') # One best result spt = sentencepiece_pb2.SentencePieceText() spt.ParseFromString(sp.encode_as_serialized_proto('hello')) # Full width hello # begin/end (offsets) are pointing to the original input. print(spt) # Nbest results nspt = sentencepiece_pb2.NBestSentencePieceText() nspt.ParseFromString(sp.nbest_encode_as_serialized_proto('hello', 5)) # print(nspt) ``` -------------------------------- ### Training and Using a Basic SentencePiece Model (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet demonstrates the core end-to-end workflow of SentencePiece. It trains a new model (`m.model`) from `botchan.txt` with a vocabulary size of 2000, then loads this model into a `SentencePieceProcessor` instance. Finally, it shows how to encode text into subword pieces and their corresponding IDs, and how to decode pieces or IDs back into text. ```python import sentencepiece as spm # train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab` # `m.vocab` is just a reference. not used in the segmentation. spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000') # makes segmenter instance and loads the model file (m.model) sp = spm.SentencePieceProcessor() sp.load('m.model') # encode: text => id print(sp.encode_as_pieces('This is a test')) print(sp.encode_as_ids('This is a test')) # decode: id => text print(sp.decode_pieces([' This', ' is', ' a', ' t', 'est'])) print(sp.decode_ids([209, 31, 9, 375, 586])) ``` -------------------------------- ### Conditional Installation of Pkg-Config File (CMake) Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt This block conditionally installs the generated `sentencepiece.pc` file into the `pkgconfig` directory under `CMAKE_INSTALL_LIBDIR`, but only if the compiler is not MSVC. This ensures proper pkg-config integration on non-Windows systems. ```CMake if (NOT MSVC) # suppress warning for C++11 features. # add_definitions("-Wno-deprecated-declarations -Wno-deprecated-enum-enum-conversion") install(FILES "${CMAKE_CURRENT_BINARY_DIR}/sentencepiece.pc" DESTINATION ${CMAKE_INSTALL_LIBDIR}/pkgconfig) endif() ``` -------------------------------- ### Querying SentencePiece Model Vocabulary (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet demonstrates various methods for querying information from a loaded SentencePiece model. It shows how to retrieve the vocabulary size, convert between piece IDs and their string representations, handle unknown tokens, and identify default control symbols like ``, ``, and ``. ```python # returns vocab size print(sp.get_piece_size()) # id <=> piece conversion print(sp.id_to_piece(209)) print(sp.piece_to_id(' This')) # returns 0 for unknown tokens (we can change the id for UNK) print(sp.piece_to_id('__MUST_BE_UNKNOWN__')) # , , are defined by default. Their ids are (0, 1, 2) # and are defined as 'control' symbol. for id in range(3): print(sp.id_to_piece(id), sp.is_control(id)) ``` -------------------------------- ### Training SentencePiece Model in C++ Source: https://github.com/google/sentencepiece/blob/master/doc/api.md This example shows how to train a new SentencePiece model using the `SentencePieceTrainer::Train` function. It specifies that training parameters can be passed as a single string, similar to the command-line utility `spm_train`. ```C++ #include sentencepiece::SentencePieceTrainer::Train("--input=test/botchan.txt --model_prefix=m --vocab_size=1000"); ``` -------------------------------- ### Training SentencePiece Model with Input Sentence Size Limit (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet demonstrates how to train a SentencePiece model using a subset of the training data by specifying `input_sentence_size`. It also shows how to load the trained model and encode text into pieces. ```Python spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --input_sentence_size=1000') sp = spm.SentencePieceProcessor() sp.load('m.model') sp.encode_as_pieces('this is a test.') ``` -------------------------------- ### Training and Using SentencePiece Unigram Model in Python Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet illustrates the process of training a Unigram model with SentencePiece, loading the resulting model, and performing subword segmentation. It highlights the use of `encode_as_pieces` for standard segmentation and `nbest_encode_as_pieces` to retrieve multiple segmentation options, a feature supported by Unigram models. The model is trained on `botchan.txt` with a vocabulary size of 2000. ```Python spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_unigram --vocab_size=2000 --model_type=unigram') sp_unigram = spm.SentencePieceProcessor() sp_unigram.load('m_unigram.model') print('*** Unigram ***') print(sp_unigram.encode_as_pieces('thisisatesthelloworld')) print(sp_unigram.nbest_encode_as_pieces('thisisatesthelloworld', 5)) ``` -------------------------------- ### Configuring RPATH for macOS Builds Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt This snippet configures Run-Time Search Path (RPATH) settings specifically for macOS builds. It enables RPATH, controls its behavior during build and install, and sets the install RPATH to the project's library directory, ensuring that shared libraries can be found at runtime after installation. ```CMake if (APPLE) set(CMAKE_MACOSX_RPATH ON) set(CMAKE_SKIP_BUILD_RPATH FALSE) set(CMAKE_BUILD_WITH_INSTALL_RPATH FALSE) set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/lib") set(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE) list(FIND CMAKE_PLATFORM_IMPLICIT_LINK_DIRECTORIES "${CMAKE_INSTALL_PREFIX}/lib" isSystemDir) if ("${isSystemDir}" STREQUAL "-1") set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/lib") endif() endif() ``` -------------------------------- ### Training and Using SentencePiece Word Model in Python Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet illustrates training a word-level segmentation model with SentencePiece, where input text is expected to be pre-tokenized and segmentation occurs primarily based on whitespaces. It demonstrates `encode_as_pieces` for word segmentation and `encode_as_ids` for retrieving corresponding vocabulary IDs. The model is trained on `botchan.txt` with a vocabulary size of 2000. ```Python spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_word --model_type=word --vocab_size=2000') sp_word = spm.SentencePieceProcessor() sp_word.load('m_word.model') print(sp_word.encode_as_pieces('this is a test.')) # '.' will not be one token. print(sp_word.encode_as_ids('this is a test.')) ``` -------------------------------- ### Applying Subword Regularization with SentencePiece in Python Source: https://github.com/google/sentencepiece/blob/master/README.md This Python example illustrates subword regularization using the SentencePiece library. It initializes a SentencePieceProcessor and repeatedly calls encode with enable_sampling=True, alpha=0.1, and nbest_size=-1. This demonstrates how the same input 'New York' can yield different segmentations due to on-the-fly subword sampling, enhancing model robustness. ```Python import sentencepiece as spm s = spm.SentencePieceProcessor(model_file='spm.model') for n in range(5): s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1) ``` -------------------------------- ### Restricting SentencePiece Vocabulary and Resetting (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This example shows how to train a SentencePiece model, then restrict its vocabulary to only include tokens appearing more than a specified frequency in the training data. It also demonstrates how to reset the vocabulary restriction. ```Python spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000') sp = spm.SentencePieceProcessor() sp.load('m.model') print(sp.encode_as_pieces('this is a test.')) # Gets all tokens as Python list. vocabs = [sp.id_to_piece(id) for id in range(sp.get_piece_size())] # Aggregates the frequency of each token in the training data. freq = {} with open('botchan.txt', 'r') as f: for line in f: line = line.rstrip() for piece in sp.encode_as_pieces(line): freq.setdefault(piece, 0) freq[piece] += 1 # only uses the token appearing more than 1000 times in the training data. vocabs = list(filter(lambda x: x in freq and freq[x] > 1000, vocabs)) sp.set_vocabulary(vocabs) print(sp.encode_as_pieces('this is a test.')) # reset the restriction sp.reset_vocabulary() print(sp.encode_as_pieces('this is a test.')) ``` -------------------------------- ### End-to-End SentencePiece Workflow (Shell) Source: https://github.com/google/sentencepiece/blob/master/README.md This snippet demonstrates the complete SentencePiece workflow. It starts by training a Unigram model using `spm_train`, then encodes a sample sentence into subword pieces and their corresponding IDs using `spm_encode`, and finally decodes the IDs back to the original sentence using `spm_decode`. This illustrates the round-trip capability of SentencePiece. ```Shell % spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000 unigram_model_trainer.cc(494) LOG(INFO) Starts training with : input: "../data/botchan.txt" ... unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091 trainer_interface.cc(272) LOG(INFO) Saving model: m.model trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab % echo "I saw a girl with a telescope." | spm_encode --model=m.model I saw a girl with a te le s c o pe . % echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id 9 459 11 939 44 11 4 142 82 8 28 21 132 6 % echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id I saw a girl with a telescope. ``` -------------------------------- ### Tokenizing Text to String Pieces in C++ Source: https://github.com/google/sentencepiece/blob/master/doc/api.md This example illustrates how to use the `SentencePieceProcessor::Encode` method to tokenize an input string into a vector of `std::string` pieces. It then iterates through and prints each generated token. ```C++ std::vector pieces; processor.Encode("This is a test.", &pieces); for (const std::string &token : pieces) { std::cout << token << std::endl; } ``` -------------------------------- ### Training with Control Symbols (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet shows how to train a SentencePiece model with "control symbols" such as `` and ``. Unlike user-defined symbols, control symbols only reserve IDs and are not treated as single tokens if they appear in the input text. They are typically inserted explicitly by the user after encoding for production settings to prevent unintended behavior from user input. The example demonstrates training and loading such a model. ```python # Example of control symbols spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_ctrl --control_symbols=, --vocab_size=2000') sp_ctrl = spm.SentencePieceProcessor() sp_ctrl.load('m_ctrl.model') ``` -------------------------------- ### Defining BOS/EOS as User-Defined Symbols in SentencePiece (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This example illustrates how to treat Beginning-of-Sentence () and End-of-Sentence () tokens as user-defined symbols rather than default control symbols in SentencePiece. It trains two models: one with default behavior where BOS/EOS are segmented, and another where they are explicitly defined as user symbols, causing them to be treated as single tokens. ```python spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bos_as_user --user_defined_symbols=, --vocab_size=2000') sp = spm.SentencePieceProcessor() sp.load('m.model') print(sp.encode_as_pieces(' hello')) # , are segmented. (default behavior) sp = spm.SentencePieceProcessor() sp.load('m_bos_as_user.model') print(sp.encode_as_pieces(' hello')) # , are handled as one token. ``` -------------------------------- ### Customizing Special Symbol IDs and Surface Forms in SentencePiece (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This example illustrates how to customize the vocabulary IDs and surface representations of special symbols (PAD, UNK, BOS, EOS) during SentencePiece model training using specific flags. It then loads the trained model and iterates through the first few IDs to print their corresponding pieces and check if they are recognized as control symbols. ```python spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 --pad_piece=[PAD] --unk_piece=[UNK] --bos_piece=[BOS] --eos_piece=[EOS]') sp = spm.SentencePieceProcessor() sp.load('m.model') for id in range(4): print(sp.id_to_piece(id), sp.is_control(id)) ``` -------------------------------- ### Training SentencePiece Model to Allow Crossing-Word Pieces (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet demonstrates training a SentencePiece model with `split_by_whitespace=false` to allow pieces to cross word boundaries. It then shows how to load the model and identify such crossing-word pieces using regular expressions. ```Python import re spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --split_by_whitespace=false') sp = spm.SentencePieceProcessor() sp.load('m.model') # Gets all tokens as Python list. vocabs = [sp.id_to_piece(id) for id in range(sp.get_piece_size())] for piece in vocabs[0:500]: if re.match('\w+ \w+', piece): print(piece) ``` -------------------------------- ### Training with User-Defined Symbols (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet demonstrates how to train a SentencePiece model with custom "user-defined symbols" like `` and ``. These symbols are treated as single tokens during encoding and can appear directly in the input text, making them useful for experimental purposes or specific NLP tasks like BERT-style tokenization. The example shows how to train, load, encode text containing these symbols, and query their IDs. ```python # Example of user defined symbols spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_user --user_defined_symbols=, --vocab_size=2000') sp_user = spm.SentencePieceProcessor() sp_user.load('m_user.model') # ids are reserved in both mode. # =0, =1, =2, =3, =4 # user defined symbols allow these symbols to appear in the text. print(sp_user.encode_as_pieces('this is a test hello world')) print(sp_user.piece_to_id('')) # 3 print(sp_user.piece_to_id('')) # 4 print('3=', sp_user.decode_ids([3])) # decoded to print('4=', sp_user.decode_ids([4])) # decoded to ``` -------------------------------- ### Configuring Abseil Library Integration (CMake) Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt This conditional block manages how the Abseil library is integrated into the project based on the `SPM_ABSL_PROVIDER` variable. It supports three modes: "internal" (uses a local copy), "module" (fetches via `FetchContent` and adds as a subdirectory), and "package" (finds an installed package). It also handles symlinking for consistent `absl` directory access. ```CMake if (SPM_ABSL_PROVIDER STREQUAL "internal") include_directories(${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl) elseif (SPM_ABSL_PROVIDER STREQUAL "module") include(FetchContent) FetchContent_Populate(abseil-cpp GIT_REPOSITORY https://github.com/abseil/abseil-cpp.git SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/third_party/abseil-cpp GIT_PROGRESS TRUE) add_subdirectory(third_party/abseil-cpp) if (NOT EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl.org) file(RENAME ${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl ${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl.org) execute_process(COMMAND ${CMAKE_COMMAND} -E create_symlink ${CMAKE_CURRENT_SOURCE_DIR}/third_party/abseil-cpp/absl ${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl) endif() elseif (SPM_ABSL_PROVIDER STREQUAL "package") find_package(absl REQUIRED) get_target_property(ABSL_INCLUDE_DIRS absl::base INTERFACE_INCLUDE_DIRECTORIES) if (NOT EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl.org) file(RENAME ${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl ${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl.org) execute_process(COMMAND ${CMAKE_COMMAND} -E create_symlink ${ABSL_INCLUDE_DIRS}/absl ${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl) endif() include_directories(${ABSL_INCLUDE_DIRS}) endif() ``` -------------------------------- ### Training and Using SentencePiece BPE Model in Python Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet demonstrates how to train a Byte Pair Encoding (BPE) model using SentencePiece, load the trained model, and then use it for subword segmentation. It shows `encode_as_pieces` for standard segmentation and `nbest_encode_as_pieces` which returns an empty list for BPE models as they don't support n-best segmentation. The model is trained on `botchan.txt` with a vocabulary size of 2000. ```Python spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bpe --vocab_size=2000 --model_type=bpe') sp_bpe = spm.SentencePieceProcessor() sp_bpe.load('m_bpe.model') print('*** BPE ***') print(sp_bpe.encode_as_pieces('thisisatesthelloworld')) print(sp_bpe.nbest_encode_as_pieces('hello world', 5)) # returns an empty list. ``` -------------------------------- ### Configuring CPack for Package Generation (CMake) Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt Configures CPack, CMake's packaging tool, to define how installation packages are generated. It sets the source and binary package generators, package version, license and readme files, contact information, and files to ignore during source packaging. Finally, `include(CPack)` enables the CPack module. ```CMake set(CPACK_SOURCE_GENERATOR "TXZ") set(CPACK_GENERATOR "7Z") set(CPACK_PACKAGE_VERSION "${SPM_VERSION}") set(CPACK_STRIP_FILES TRUE) set(CPACK_RESOURCE_FILE_LICENSE "${PROJECT_SOURCE_DIR}/LICENSE") set(CPACK_RESOURCE_FILE_README "${PROJECT_SOURCE_DIR}/README.md") set(CPACK_PACKAGE_CONTACT "taku@google.com") set(CPACK_DEBIAN_PACKAGE_MAINTAINER "Taku Kudo") set(CPACK_SOURCE_IGNORE_FILES "/build/;/.git/;/dist/;/sdist/;~$;${CPACK_SOURCE_IGNORE_FILES}") include(CPack) ``` -------------------------------- ### Using ImmutableSentencePieceText for Decoding in C++ Source: https://github.com/google/sentencepiece/blob/master/doc/api.md This example illustrates how to use `ImmutableSentencePieceText` for decoding a sequence of IDs. It shows how to populate the proto with decoded data and then access the reconstructed text along with detailed information for each piece, including byte offsets and IDs. ```C++ processor.Decode({10, 20, 30}, spt.mutable_proto()); std::cout << spt.text() << std::endl; // This is the same as the decoded string. for (const auto &piece : spt.pieces()) { // the same as above. } ``` -------------------------------- ### Encoding Text with BOS/EOS Markers and Reversal Source: https://github.com/google/sentencepiece/blob/master/README.md These examples show how to use the `--extra_options` flag with `spm_encode` to add beginning-of-sentence (BOS) and/or end-of-sentence (EOS) markers to the encoded output, or to reverse the input sequence before encoding. ```Shell spm_encode --extra_options=eos (add only) spm_encode --extra_options=bos:eos (add and ) spm_encode --extra_options=reverse:bos:eos (reverse input and add and ) ``` -------------------------------- ### Loading SentencePiece Model from Byte Stream (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet illustrates how to load a SentencePiece model directly from a serialized protocol buffer byte stream, rather than from a file path. It uses TensorFlow's `tf.io.gfile.GFile` to read the model file into memory, which is useful when models are stored in non-Posix file systems or need to be loaded dynamically. After loading, it demonstrates encoding text. ```python import tensorflow as tf # Assumes that m.model is stored in non-Posix file system. serialized_model_proto = tf.io.gfile.GFile('m.model', 'rb').read() sp = spm.SentencePieceProcessor() sp.load_from_serialized_proto(serialized_model_proto) print(sp.encode_as_pieces('this is a test')) ``` -------------------------------- ### Performing Sampled Segmentation for Subword Regularization in SentencePiece (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet demonstrates how to use sampled segmentation with a SentencePiece Unigram model for subword regularization, a technique for data augmentation. It shows how to obtain different segmentations for the same input text by repeatedly calling sample_encode_as_pieces and sample_encode_as_ids with specified nbest_size and inverse_temperature parameters. ```python spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000') # Can obtain different segmentations per request. # There are two hyperparameters for sampling (nbest_size and inverse temperature). see the paper [kudo18] for detail. for n in range(10): print(sp.sample_encode_as_pieces('hello world', -1, 0.1)) for n in range(10): print(sp.sample_encode_as_ids('hello world', -1, 0.1)) ``` -------------------------------- ### Setting Extra Decoding Options in C++ Source: https://github.com/google/sentencepiece/blob/master/doc/api.md This snippet demonstrates how to use `SetDecodeExtraOptions` to apply modifications during the decoding process. It shows an example of reversing the decoder's output. This method should be invoked immediately after the model has been loaded. ```C++ processor.SetDecodeExtraOptions("reverse"); // the decoder's output is reversed. ``` -------------------------------- ### Training SentencePiece with Pre-defined Normalization Rule in Python Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet shows how to train a SentencePiece model using a pre-defined normalization rule, specifically `nfkc_cf` for NFKC normalization and Unicode case folding (lower casing). After training, the model is loaded and used to demonstrate how input text is automatically normalized during segmentation, converting full-width characters and uppercasing to their normalized, lowercased forms. ```Python # NFKC normalization and lower casing. spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_name=nfkc_cf') sp = spm.SentencePieceProcessor() sp.load('m.model') print(sp.encode_as_pieces('HELLO WORLD.')) # lower casing and normalization ``` -------------------------------- ### Setting Extra Encoding Options in C++ Source: https://github.com/google/sentencepiece/blob/master/doc/api.md This example shows how to use `SetEncodeExtraOptions` to modify the encoding behavior. It demonstrates adding beginning-of-sentence (bos) and end-of-sentence (eos) tokens, and also reversing the input before adding these tokens. This method must be called after the model is loaded. ```C++ processor.SetEncodeExtraOptions("bos:eos"); // add and . processor.SetEncodeExtraOptions("reverse:bos:eos"); // reverse the input and then add and . ``` -------------------------------- ### Customizing UNK Symbol Surface Representation in SentencePiece (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This example shows how to change the surface representation of the Unknown (UNK) symbol when decoding IDs in SentencePiece. It first demonstrates the default UNK surface (U+2047) and then trains a new model with the --unk_surface flag to set a custom string, __UNKNOWN__, for the UNK symbol's decoded form. ```python spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m') sp = spm.SentencePieceProcessor() sp.load('m.model') print(sp.decode_ids([sp.unk_id()])) # default is U+2047 spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --unk_surface=__UNKNOWN__') sp = spm.SentencePieceProcessor() sp.load('m.model') print(sp.decode_ids([sp.unk_id()])) ``` -------------------------------- ### Getting Vocabulary Size using len() with SentencePiece in Python Source: https://github.com/google/sentencepiece/blob/master/python/README.md Demonstrates using the built-in len() function on a SentencePieceProcessor instance to retrieve the vocabulary size, providing a convenient alternative to get_piece_size(). ```Python len(sp) ``` -------------------------------- ### Encoding and Decoding Control Symbols with SentencePiece in Python Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet demonstrates how SentencePiece handles pre-defined control symbols like and . It shows encoding text containing these symbols into pieces, retrieving their corresponding IDs, and decoding these IDs back to their surface forms, which often results in an empty string for control symbols. ```python print(sp_ctrl.encode_as_pieces('this is a test hello world')) print(sp_ctrl.piece_to_id('')) # 3 print(sp_ctrl.piece_to_id('')) # 4 print('3=', sp_ctrl.decode_ids([3])) # decoded to empty print('4=', sp_ctrl.decode_ids([4])) # decoded to empty ``` -------------------------------- ### Accessing and Using Special Symbol IDs in SentencePiece (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet demonstrates how to retrieve the default IDs for Beginning-of-Sentence (BOS), End-of-Sentence (EOS), Unknown (UNK), and Padding (PAD) symbols using bos_id(), eos_id(), unk_id(), and pad_id() methods in SentencePiece. It also shows how to manually prepend and append BOS/EOS IDs to an encoded sequence of text. ```python spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000') sp = spm.SentencePieceProcessor() sp.load('m.model') print('bos=', sp.bos_id()) print('eos=', sp.eos_id()) print('unk=', sp.unk_id()) print('pad=', sp.pad_id()) # disabled by default print(sp.encode_as_ids('Hello world')) # Prepend or append bos/eos ids. print([sp.bos_id()] + sp.encode_as_ids('Hello world') + [sp.eos_id()]) ``` -------------------------------- ### Getting Vocabulary Size with SentencePiece in Python Source: https://github.com/google/sentencepiece/blob/master/python/README.md Retrieves the total number of unique pieces (vocabulary size) in the loaded SentencePiece model using sp.get_piece_size(). This indicates the size of the model's vocabulary. ```Python sp.get_piece_size() ``` -------------------------------- ### Detokenizing Text from Pieces or IDs in C++ Source: https://github.com/google/sentencepiece/blob/master/doc/api.md This example shows how to use `SentencePieceProcessor::Decode` to reconstruct text from either a sequence of string pieces or a sequence of integer IDs. It highlights that detokenization is designed to be the inverse operation of encoding, ensuring consistency. ```C++ std::vector pieces = { " This", " is", " a", " ", "te", "st", "." }; // sequence of pieces std::string text; processor.Decode(pieces, &text); std::cout << text << std::endl; std::vector ids = { 451, 26, 20, 3, 158, 128, 12 }; // sequence of ids processor.Decode(ids, &text); std::cout << text << std::endl; ``` -------------------------------- ### Training SentencePiece with Custom Normalization Rules from TSV in Python Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet demonstrates how to define and apply custom text normalization rules in SentencePiece using a TSV file. It includes a helper function `tocode` to convert strings to Unicode code points for the TSV format, creates a `normalization_rule.tsv` file with "I'm" to "I am" and "don't" to "do not" mappings, and then trains a SentencePiece model using this custom rule. Finally, it shows how the loaded model automatically applies these rules during segmentation. ```Python def tocode(s): out = [] for c in s: out.append(str(hex(ord(c))).replace('0x', 'U+')) return ' '.join(out) # TSV format: source Unicode code points target code points # normalize "don't => do not, I'm => I am" with open('normalization_rule.tsv', 'w') as f: f.write(tocode("I'm") + '\t' + tocode("I am") + '\n') f.write(tocode("don't") + '\t' + tocode("do not") + '\n') print(open('normalization_rule.tsv', 'r').read()) spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_tsv=normalization_rule.tsv') sp = spm.SentencePieceProcessor() # m.model embeds the normalization rule compiled into an FST. sp.load('m.model') print(sp.encode_as_pieces("I'm busy")) # normalized to `I am busy' print(sp.encode_as_pieces("I don't know it.")) # normalized to 'I do not know it.' ``` -------------------------------- ### Initializing SentencePieceProcessor in Python Source: https://github.com/google/sentencepiece/blob/master/python/README.md Demonstrates how to import the SentencePiece library and initialize a SentencePieceProcessor instance by loading a pre-trained model file. This is the first step to perform any segmentation or tokenization operations. ```Python import sentencepiece as spm sp = spm.SentencePieceProcessor(model_file='test/test_model.model') ``` -------------------------------- ### Disabling BOS/EOS Symbols in SentencePiece (Python) Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb This snippet demonstrates how to disable Beginning-of-Sentence (BOS) and End-of-Sentence (EOS) symbols in a SentencePiece model by setting their IDs to -1 during training. When disabled, these symbols are no longer treated as special tokens and will instead be mapped to the Unknown (UNK) symbol ID if encountered. ```python spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --bos_id=-1 --eos_id=-1') sp = spm.SentencePieceProcessor() sp.load('m.model') # , are UNK. print(sp.unk_id()) print(sp.piece_to_id('')) print(sp.piece_to_id('')) ``` -------------------------------- ### Initializing CMake Project and Setting Version Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt This snippet sets the minimum required CMake version, reads the project version from VERSION.txt, displays it, and then defines the SentencePiece project with C and C++ language support. It also handles CMake policy CMP0091 for modern behavior. ```CMake cmake_minimum_required(VERSION 3.5 FATAL_ERROR) file(STRINGS "VERSION.txt" SPM_VERSION) message(STATUS "VERSION: ${SPM_VERSION}") if(POLICY CMP0091) cmake_policy(SET CMP0091 NEW) endif() project(sentencepiece VERSION ${SPM_VERSION} LANGUAGES C CXX) ``` -------------------------------- ### Loading SentencePiece Model in C++ Source: https://github.com/google/sentencepiece/blob/master/doc/api.md This snippet demonstrates how to initialize the SentencePieceProcessor and load a pre-trained model. It shows loading from a file path and includes basic error handling. It also comments on the alternative method of loading from a serialized proto string. ```C++ #include sentencepiece::SentencePieceProcessor processor; const auto status = processor.Load("//path/to/model.model"); if (!status.ok()) { std::cerr << status.ToString() << std::endl; // error } // You can also load a serialized model from std::string. // const std::string str = // Load blob contents from a file. // auto status = processor.LoadFromSerializedProto(str); ``` -------------------------------- ### Training a SentencePiece Model in Python Source: https://github.com/google/sentencepiece/blob/master/python/README.md Illustrates how to train a new SentencePiece model using SentencePieceTrainer.train(). It takes an input text file, a model prefix for output files, and parameters like vocabulary size and user-defined symbols. ```Python import sentencepiece as spm spm.SentencePieceTrainer.train(input='test/botchan.txt', model_prefix='m', vocab_size=1000, user_defined_symbols=['foo', 'bar']) ``` -------------------------------- ### Verifying SentencePiece Release Binary with SLSA Source: https://github.com/google/sentencepiece/blob/master/README.md This command uses the `slsa-verifier` tool to verify the authenticity and integrity of a SentencePiece release binary. It requires the artifact path, the downloaded provenance file (`attestation.intoto.jsonl`), the source repository, and the release tag. ```Shell slsa-verifier -artifact-path -provenance attestation.intoto.jsonl -source github.com/google/sentencepiece -tag ``` -------------------------------- ### Configuring Project Files from Templates (CMake) Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt Uses `configure_file` to generate `config.h` from `config.h.in` and `sentencepiece.pc` from `sentencepiece.pc.in`. The `@ONLY` option for `sentencepiece.pc` ensures that only CMake variables are substituted, not shell variables. ```CMake configure_file("${PROJECT_SOURCE_DIR}/config.h.in" "config.h") configure_file("${PROJECT_SOURCE_DIR}/sentencepiece.pc.in" "sentencepiece.pc" @ONLY) ``` -------------------------------- ### Training a SentencePiece Model Source: https://github.com/google/sentencepiece/blob/master/README.md This command trains a new SentencePiece model from a raw corpus file. Key parameters include the input file, output model prefix, desired vocabulary size, character coverage, and the model type (unigram, BPE, char, or word). ```Shell spm_train --input= --model_prefix= --vocab_size=8000 --character_coverage=1.0 --model_type= ```