### Installing SentencePiece and Preparing Data (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet installs the `sentencepiece` Python library using pip and downloads a sample text file (`botchan.txt`) from GitHub. The text file will be used as training data for the SentencePiece model. These are prerequisite steps for running the subsequent examples.

```python
!pip install sentencepiece
!wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt
```

--------------------------------

### Installing Protobuf for SentencePiece Byte Offset Extraction (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet shows the command to install the `protobuf` Python module, which is a prerequisite for extracting byte offsets and other metadata from SentencePiece models.

```Python
!pip install protobuf
```

--------------------------------

### Installing SentencePiece Python Module via Pip

Source: https://github.com/google/sentencepiece/blob/master/README.md

This command provides the standard method for installing the SentencePiece Python wrapper. Executing pip install sentencepiece will download and install the necessary package, enabling access to SentencePiece's training and segmentation functionalities within Python environments.

```Shell
pip install sentencepiece
```

--------------------------------

### Installing SentencePiece Python Module via pip

Source: https://github.com/google/sentencepiece/blob/master/python/README.md

This command installs the SentencePiece Python module directly using pip, suitable for Linux, macOS, and Windows environments. It's the simplest way to get started with SentencePiece.

```Shell
pip install sentencepiece
```

--------------------------------

### Installing SentencePiece using vcpkg

Source: https://github.com/google/sentencepiece/blob/master/README.md

This sequence of commands demonstrates how to use the vcpkg dependency manager to download, build, and install SentencePiece. It covers cloning vcpkg, bootstrapping it, integrating it with your development environment, and finally installing the SentencePiece port.

```Shell
git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg integrate install
./vcpkg install sentencepiece
```

--------------------------------

### Building and Installing SentencePiece Python Wrapper from Source (Windows)

Source: https://github.com/google/sentencepiece/blob/master/python/README.md

These commands build and install the SentencePiece Python wrapper from source on Windows using Visual Studio Developer PowerShell. It includes cloning the repository, configuring with CMake, building, installing, and then building and installing the Python wheel package using PowerShell commands.

```PowerShell
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build
cd build
cmake .. -DSPM_ENABLE_SHARED=OFF -DCMAKE_INSTALL_PREFIX=".\\root" 
cmake --build . --config Release --target install
cd ../python
pip install wheel
python setup.py bdist_wheel
Get-ChildItem .\dist\sentencepiece*.whl | ForEach-Object { pip install $_.FullName }
```

--------------------------------

### Installing SentencePiece Python Wheel

Source: https://github.com/google/sentencepiece/blob/master/README.md

This command installs a SentencePiece Python wheel file (.whl) downloaded from the GitHub releases page using pip, the Python package installer.

```Python
pip install wheel_file.whl
```

--------------------------------

### Building and Installing SentencePiece Python Wrapper from Source (Linux/macOS)

Source: https://github.com/google/sentencepiece/blob/master/python/README.md

These commands build and install the SentencePiece Python wrapper from source on Linux and macOS. It involves cloning the repository, configuring with CMake, compiling, installing, and then building and installing the Python wheel package.

```Shell
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build
cd build
cmake .. -DSPM_ENABLE_SHARED=OFF -DCMAKE_INSTALL_PREFIX=./root
make install
cd ../python
python setup.py bdist_wheel
pip install dist/sentencepiece*.whl
```

--------------------------------

### Installing Build Dependencies on Ubuntu

Source: https://github.com/google/sentencepiece/blob/master/README.md

This command installs the necessary build tools and libraries for SentencePiece on Ubuntu, including CMake, essential build utilities, pkg-config, and the optional gperftools library for performance improvements.

```Shell
sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev
```

--------------------------------

### Building and Installing SentencePiece from C++ Source

Source: https://github.com/google/sentencepiece/blob/master/README.md

These commands outline the process to clone the SentencePiece repository, create a build directory, configure the build with CMake, compile the source code using Make, and install the command-line tools system-wide. It also includes post-installation steps for library linking.

```Shell
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo ldconfig -v
```

--------------------------------

### Installing SentencePiece Python Wrapper to User Directory

Source: https://github.com/google/sentencepiece/blob/master/python/README.md

This command installs the SentencePiece Python wrapper into the user's local site-packages directory, avoiding the need for global write permissions. It's useful when installing without root privileges.

```Shell
python setup.py install --user
```

--------------------------------

### Setting Default Installation Directory Variables

Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt

This snippet ensures that standard CMake installation directory variables (CMAKE_INSTALL_BINDIR, CMAKE_INSTALL_LIBDIR, CMAKE_INSTALL_INCLUDEDIR) are defined. If they are not already set, they are assigned default values like 'bin', 'lib', and 'include' respectively, providing consistent installation paths.

```CMake
if (NOT DEFINED CMAKE_INSTALL_BINDIR)
  set(CMAKE_INSTALL_BINDIR bin)
endif()

if (NOT DEFINED CMAKE_INSTALL_LIBDIR)
  set(CMAKE_INSTALL_LIBDIR lib)
endif()

if (NOT DEFINED CMAKE_INSTALL_INCLUDEDIR)
  set(CMAKE_INSTALL_INCLUDEDIR include)
endif()
```

--------------------------------

### Installing SentencePiece Header Files (CMake)

Source: https://github.com/google/sentencepiece/blob/master/src/CMakeLists.txt

This snippet installs the public header files for SentencePiece (sentencepiece_trainer.h, sentencepiece_processor.h) to the standard include directory. It also conditionally installs Protobuf-related headers if an external Protobuf provider is used.

```CMake
install(FILES sentencepiece_trainer.h sentencepiece_processor.h
  DESTINATION ${CMAKE_INSTALL_INCLUDEDIR})
if (NOT SPM_PROTOBUF_PROVIDER STREQUAL "internal")
  install(FILES ${SPM_PROTO_HDRS} DESTINATION ${CMAKE_INSTALL_INCLUDEDIR})
endif()
```

--------------------------------

### Training SentencePiece Model from Word Frequency List (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This example illustrates how to prepare a TSV file containing words and their frequencies, and then train a SentencePiece model using this file with the `input_format=tsv` option. It concludes by demonstrating text encoding with the trained model.

```Python
freq = {}
with open('botchan.txt', 'r') as f:
  for line in f:
    line = line.rstrip()
    for piece in line.split():
      freq.setdefault(piece, 0)
      freq[piece] += 1

with open('word_freq_list.tsv', 'w') as f:
  for k, v in freq.items():
    f.write('%s\t%d\n' % (k, v))

spm.SentencePieceTrainer.train('--input=word_freq_list.tsv --input_format=tsv --model_prefix=m --vocab_size=2000')
sp = spm.SentencePieceProcessor()
sp.load('m.model')

print(sp.encode_as_pieces('this is a test.'))
```

--------------------------------

### Retrieving N-Best Segmentations in SentencePiece (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This example illustrates how to retrieve the top N-best segmentations for a given input text using SentencePiece's nbest_encode_as_pieces and nbest_encode_as_ids methods. It demonstrates getting the 10 best segmentations, providing alternative tokenizations for applications like subword regularization.

```python
print(sp.nbest_encode_as_pieces('hello world', 10))
print(sp.nbest_encode_as_ids('hello world', 10))
```

--------------------------------

### Using join_paths for Installation Directory Configuration (CMake)

Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt

Calls the `join_paths` function to construct paths for `libdir_for_pc_file` and `includedir_for_pc_file`. These paths are typically used in `.pc` (pkg-config) files to specify installation directories relative to `exec_prefix` and `prefix`.

```CMake
join_paths(libdir_for_pc_file "\${exec_prefix}" "${CMAKE_INSTALL_LIBDIR}")
join_paths(includedir_for_pc_file "\${prefix}" "${CMAKE_INSTALL_INCLUDEDIR}")
```

--------------------------------

### Configuring Conditional Installation for Targets (CMake)

Source: https://github.com/google/sentencepiece/blob/master/src/CMakeLists.txt

This section defines the installation rules for the SPM_INSTALLTARGETS based on the operating system. It specifies different installation destinations and bundle options for iOS versus other systems, ensuring proper deployment.

```CMake
if (CMAKE_SYSTEM_NAME STREQUAL "iOS")
  install(TARGETS ${SPM_INSTALLTARGETS}
    BUNDLE DESTINATION ${CMAKE_INSTALL_BINDIR}
    RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR}
    LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
    ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR})
else()
install(TARGETS ${SPM_INSTALLTARGETS}
  RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR}
  LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
  ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR})
endif()
```

--------------------------------

### Configuring Platform-Specific Installation Paths

Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt

This block configures installation directory variables (prefix, exec_prefix, libdir, includedir) based on the operating system. For Unix systems, it includes GNUInstallDirs for standard paths; otherwise, it sets generic paths. It also sets a GNUCXX_STD_SUPPORT_VERSION.

```CMake
if (UNIX)
  include(GNUInstallDirs)
  set(prefix ${CMAKE_INSTALL_PREFIX})
  set(exec_prefix "\${prefix}")
  set(libdir "\${exec_prefix}/${CMAKE_INSTALL_LIBDIR}")
  set(includedir "\${prefix}/${CMAKE_INSTALL_INCLUDEDIR}")
else()
  set(prefix ${CMAKE_INSTALL_PREFIX})
  set(exec_prefix "\${prefix}")
  set(libdir "\${exec_prefix}/lib")
  set(includedir "\${prefix}/include")
endif()
set(GNUCXX_STD_SUPPORT_VERSION "4.3")
```

--------------------------------

### Appending Core Executables to Install Targets List (CMake)

Source: https://github.com/google/sentencepiece/blob/master/src/CMakeLists.txt

This command appends the core SentencePiece executables to the SPM_INSTALLTARGETS list. This list is later used by the install command to specify which targets should be installed.

```CMake
list(APPEND SPM_INSTALLTARGETS
  spm_encode spm_decode spm_normalize spm_train spm_export_vocab)
```

--------------------------------

### Training and Using SentencePiece Character Model in Python

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet demonstrates how to train a character-level segmentation model using SentencePiece and then apply it to segment text. It shows both `encode_as_pieces` to get character pieces and `encode_as_ids` to get their corresponding vocabulary IDs. The model is trained on `botchan.txt` with a vocabulary size of 2000, treating each character as a unit.

```Python
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_char --model_type=char --vocab_size=2000')

sp_char = spm.SentencePieceProcessor()
sp_char.load('m_char.model')

print(sp_char.encode_as_pieces('this is a test.'))
print(sp_char.encode_as_ids('this is a test.'))
```

--------------------------------

### Extracting Byte Offsets and N-Best Results from SentencePiece (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This example demonstrates how to train a SentencePiece model and then use `encode_as_serialized_proto` and `nbest_encode_as_serialized_proto` methods to retrieve token byte offsets and N-best segmentation results, parsed using the `sentencepiece_pb2` module.

```Python
from sentencepiece import sentencepiece_pb2

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

# One best result
spt = sentencepiece_pb2.SentencePieceText()
spt.ParseFromString(sp.encode_as_serialized_proto('ｈｅｌｌｏ')) # Full width hello

# begin/end (offsets) are pointing to the original input.
print(spt)

# Nbest results
nspt = sentencepiece_pb2.NBestSentencePieceText()
nspt.ParseFromString(sp.nbest_encode_as_serialized_proto('ｈｅｌｌｏ', 5))
# print(nspt)
```

--------------------------------

### Training and Using a Basic SentencePiece Model (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet demonstrates the core end-to-end workflow of SentencePiece. It trains a new model (`m.model`) from `botchan.txt` with a vocabulary size of 2000, then loads this model into a `SentencePieceProcessor` instance. Finally, it shows how to encode text into subword pieces and their corresponding IDs, and how to decode pieces or IDs back into text.

```python
import sentencepiece as spm

# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')

# encode: text => id
print(sp.encode_as_pieces('This is a test'))
print(sp.encode_as_ids('This is a test'))

# decode: id => text
print(sp.decode_pieces([' This', ' is', ' a', ' t', 'est']))
print(sp.decode_ids([209, 31, 9, 375, 586]))
```

--------------------------------

### Conditional Installation of Pkg-Config File (CMake)

Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt

This block conditionally installs the generated `sentencepiece.pc` file into the `pkgconfig` directory under `CMAKE_INSTALL_LIBDIR`, but only if the compiler is not MSVC. This ensures proper pkg-config integration on non-Windows systems.

```CMake
if (NOT MSVC)
  # suppress warning for C++11 features.
#  add_definitions("-Wno-deprecated-declarations -Wno-deprecated-enum-enum-conversion")
  install(FILES "${CMAKE_CURRENT_BINARY_DIR}/sentencepiece.pc" DESTINATION ${CMAKE_INSTALL_LIBDIR}/pkgconfig)
endif()
```

--------------------------------

### Querying SentencePiece Model Vocabulary (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet demonstrates various methods for querying information from a loaded SentencePiece model. It shows how to retrieve the vocabulary size, convert between piece IDs and their string representations, handle unknown tokens, and identify default control symbols like `<unk>`, `<s>`, and `</s>`.

```python
# returns vocab size
print(sp.get_piece_size())

# id <=> piece conversion
print(sp.id_to_piece(209))
print(sp.piece_to_id(' This'))

# returns 0 for unknown tokens (we can change the id for UNK)
print(sp.piece_to_id('__MUST_BE_UNKNOWN__'))

# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
# <s> and </s> are defined as 'control' symbol.
for id in range(3):
  print(sp.id_to_piece(id), sp.is_control(id))
```

--------------------------------

### Training SentencePiece Model in C++

Source: https://github.com/google/sentencepiece/blob/master/doc/api.md

This example shows how to train a new SentencePiece model using the `SentencePieceTrainer::Train` function. It specifies that training parameters can be passed as a single string, similar to the command-line utility `spm_train`.

```C++
#include <sentencepiece_trainer.h>

sentencepiece::SentencePieceTrainer::Train("--input=test/botchan.txt --model_prefix=m --vocab_size=1000");
```

--------------------------------

### Training SentencePiece Model with Input Sentence Size Limit (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet demonstrates how to train a SentencePiece model using a subset of the training data by specifying `input_sentence_size`. It also shows how to load the trained model and encode text into pieces.

```Python
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --input_sentence_size=1000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

sp.encode_as_pieces('this is a test.')
```

--------------------------------

### Training and Using SentencePiece Unigram Model in Python

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet illustrates the process of training a Unigram model with SentencePiece, loading the resulting model, and performing subword segmentation. It highlights the use of `encode_as_pieces` for standard segmentation and `nbest_encode_as_pieces` to retrieve multiple segmentation options, a feature supported by Unigram models. The model is trained on `botchan.txt` with a vocabulary size of 2000.

```Python
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_unigram --vocab_size=2000 --model_type=unigram')
sp_unigram = spm.SentencePieceProcessor()
sp_unigram.load('m_unigram.model')

print('*** Unigram ***')
print(sp_unigram.encode_as_pieces('thisisatesthelloworld'))
print(sp_unigram.nbest_encode_as_pieces('thisisatesthelloworld', 5))
```

--------------------------------

### Configuring RPATH for macOS Builds

Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt

This snippet configures Run-Time Search Path (RPATH) settings specifically for macOS builds. It enables RPATH, controls its behavior during build and install, and sets the install RPATH to the project's library directory, ensuring that shared libraries can be found at runtime after installation.

```CMake
if (APPLE)
  set(CMAKE_MACOSX_RPATH ON)
  set(CMAKE_SKIP_BUILD_RPATH FALSE)
  set(CMAKE_BUILD_WITH_INSTALL_RPATH FALSE)
  set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/lib")
  set(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE)
  list(FIND CMAKE_PLATFORM_IMPLICIT_LINK_DIRECTORIES "${CMAKE_INSTALL_PREFIX}/lib" isSystemDir)
  if ("${isSystemDir}" STREQUAL "-1")
    set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/lib")
  endif()
endif()
```

--------------------------------

### Training and Using SentencePiece Word Model in Python

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet illustrates training a word-level segmentation model with SentencePiece, where input text is expected to be pre-tokenized and segmentation occurs primarily based on whitespaces. It demonstrates `encode_as_pieces` for word segmentation and `encode_as_ids` for retrieving corresponding vocabulary IDs. The model is trained on `botchan.txt` with a vocabulary size of 2000.

```Python
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_word --model_type=word --vocab_size=2000')

sp_word = spm.SentencePieceProcessor()
sp_word.load('m_word.model')

print(sp_word.encode_as_pieces('this is a test.'))  # '.' will not be one token.
print(sp_word.encode_as_ids('this is a test.'))
```

--------------------------------

### Applying Subword Regularization with SentencePiece in Python

Source: https://github.com/google/sentencepiece/blob/master/README.md

This Python example illustrates subword regularization using the SentencePiece library. It initializes a SentencePieceProcessor and repeatedly calls encode with enable_sampling=True, alpha=0.1, and nbest_size=-1. This demonstrates how the same input 'New York' can yield different segmentations due to on-the-fly subword sampling, enhancing model robustness.

```Python
import sentencepiece as spm
s = spm.SentencePieceProcessor(model_file='spm.model')
for n in range(5):
    s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)
```

--------------------------------

### Restricting SentencePiece Vocabulary and Resetting (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This example shows how to train a SentencePiece model, then restrict its vocabulary to only include tokens appearing more than a specified frequency in the training data. It also demonstrates how to reset the vocabulary restriction.

```Python
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

print(sp.encode_as_pieces('this is a test.'))

# Gets all tokens as Python list.
vocabs = [sp.id_to_piece(id) for id in range(sp.get_piece_size())]

# Aggregates the frequency of each token in the training data.
freq = {}
with open('botchan.txt', 'r') as f:
    for line in f:
        line = line.rstrip()
        for piece in sp.encode_as_pieces(line):
            freq.setdefault(piece, 0)
            freq[piece] += 1

# only uses the token appearing more than 1000 times in the training data.
vocabs = list(filter(lambda x: x in freq and freq[x] > 1000, vocabs))
sp.set_vocabulary(vocabs)
print(sp.encode_as_pieces('this is a test.'))

# reset the restriction
sp.reset_vocabulary()
print(sp.encode_as_pieces('this is a test.'))
```

--------------------------------

### End-to-End SentencePiece Workflow (Shell)

Source: https://github.com/google/sentencepiece/blob/master/README.md

This snippet demonstrates the complete SentencePiece workflow. It starts by training a Unigram model using `spm_train`, then encodes a sample sentence into subword pieces and their corresponding IDs using `spm_encode`, and finally decodes the IDs back to the original sentence using `spm_decode`. This illustrates the round-trip capability of SentencePiece.

```Shell
% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "../data/botchan.txt"
... <snip>
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(272) LOG(INFO) Saving model: m.model
trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab

% echo "I saw a girl with a telescope." | spm_encode --model=m.model
 I  saw  a  girl  with  a   te le s c o pe .

% echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
9 459 11 939 44 11 4 142 82 8 28 21 132 6

% echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
I saw a girl with a telescope.
```

--------------------------------

### Tokenizing Text to String Pieces in C++

Source: https://github.com/google/sentencepiece/blob/master/doc/api.md

This example illustrates how to use the `SentencePieceProcessor::Encode` method to tokenize an input string into a vector of `std::string` pieces. It then iterates through and prints each generated token.

```C++
std::vector<std::string> pieces;
processor.Encode("This is a test.", &pieces);
for (const std::string &token : pieces) {
  std::cout << token << std::endl;
}
```

--------------------------------

### Training with Control Symbols (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet shows how to train a SentencePiece model with "control symbols" such as `<sep>` and `<cls>`. Unlike user-defined symbols, control symbols only reserve IDs and are not treated as single tokens if they appear in the input text. They are typically inserted explicitly by the user after encoding for production settings to prevent unintended behavior from user input. The example demonstrates training and loading such a model.

```python
# Example of control symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_ctrl --control_symbols=<sep>,<cls> --vocab_size=2000')

sp_ctrl = spm.SentencePieceProcessor()
sp_ctrl.load('m_ctrl.model')
```

--------------------------------

### Defining BOS/EOS as User-Defined Symbols in SentencePiece (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This example illustrates how to treat Beginning-of-Sentence (<s>) and End-of-Sentence (</s>) tokens as user-defined symbols rather than default control symbols in SentencePiece. It trains two models: one with default behavior where BOS/EOS are segmented, and another where they are explicitly defined as user symbols, causing them to be treated as single tokens.

```python
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bos_as_user --user_defined_symbols=<s>,</s> --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('<s> hello</s>'))   # <s>,</s> are segmented. (default behavior)

sp = spm.SentencePieceProcessor()
sp.load('m_bos_as_user.model')
print(sp.encode_as_pieces('<s> hello</s>'))   # <s>,</s> are handled as one token.
```

--------------------------------

### Customizing Special Symbol IDs and Surface Forms in SentencePiece (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This example illustrates how to customize the vocabulary IDs and surface representations of special symbols (PAD, UNK, BOS, EOS) during SentencePiece model training using specific flags. It then loads the trained model and iterates through the first few IDs to print their corresponding pieces and check if they are recognized as control symbols.

```python
spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 --pad_piece=[PAD] --unk_piece=[UNK] --bos_piece=[BOS] --eos_piece=[EOS]')
sp = spm.SentencePieceProcessor()
sp.load('m.model')


for id in range(4):
    print(sp.id_to_piece(id), sp.is_control(id))
```

--------------------------------

### Training SentencePiece Model to Allow Crossing-Word Pieces (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet demonstrates training a SentencePiece model with `split_by_whitespace=false` to allow pieces to cross word boundaries. It then shows how to load the model and identify such crossing-word pieces using regular expressions.

```Python
import re

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --split_by_whitespace=false')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

# Gets all tokens as Python list.
vocabs = [sp.id_to_piece(id) for id in range(sp.get_piece_size())]

for piece in vocabs[0:500]:
    if re.match('\w+ \w+', piece):
        print(piece)
```

--------------------------------

### Training with User-Defined Symbols (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet demonstrates how to train a SentencePiece model with custom "user-defined symbols" like `<sep>` and `<cls>`. These symbols are treated as single tokens during encoding and can appear directly in the input text, making them useful for experimental purposes or specific NLP tasks like BERT-style tokenization. The example shows how to train, load, encode text containing these symbols, and query their IDs.

```python
# Example of user defined symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_user --user_defined_symbols=<sep>,<cls> --vocab_size=2000')

sp_user = spm.SentencePieceProcessor()
sp_user.load('m_user.model')

# ids are reserved in both mode.
# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4
# user defined symbols allow these symbols to appear in the text.
print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_user.piece_to_id('<sep>'))  # 3
print(sp_user.piece_to_id('<cls>'))  # 4
print('3=', sp_user.decode_ids([3]))  # decoded to <sep>
print('4=', sp_user.decode_ids([4]))  # decoded to <cls>
```

--------------------------------

### Configuring Abseil Library Integration (CMake)

Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt

This conditional block manages how the Abseil library is integrated into the project based on the `SPM_ABSL_PROVIDER` variable. It supports three modes: "internal" (uses a local copy), "module" (fetches via `FetchContent` and adds as a subdirectory), and "package" (finds an installed package). It also handles symlinking for consistent `absl` directory access.

```CMake
if (SPM_ABSL_PROVIDER STREQUAL "internal")
  include_directories(${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl)
elseif (SPM_ABSL_PROVIDER STREQUAL "module")
  include(FetchContent)
  FetchContent_Populate(abseil-cpp
        GIT_REPOSITORY  https://github.com/abseil/abseil-cpp.git
        SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/third_party/abseil-cpp
        GIT_PROGRESS TRUE)
  add_subdirectory(third_party/abseil-cpp)
  if (NOT EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl.org)
    file(RENAME ${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl ${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl.org)
    execute_process(COMMAND ${CMAKE_COMMAND} -E create_symlink
      ${CMAKE_CURRENT_SOURCE_DIR}/third_party/abseil-cpp/absl
      ${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl)
  endif()
elseif (SPM_ABSL_PROVIDER STREQUAL "package")
  find_package(absl REQUIRED)
  get_target_property(ABSL_INCLUDE_DIRS absl::base INTERFACE_INCLUDE_DIRECTORIES)
  if (NOT EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl.org)
    file(RENAME ${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl ${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl.org)
    execute_process(COMMAND ${CMAKE_COMMAND} -E create_symlink
        ${ABSL_INCLUDE_DIRS}/absl ${CMAKE_CURRENT_SOURCE_DIR}/third_party/absl)
  endif()
  include_directories(${ABSL_INCLUDE_DIRS})
endif()
```

--------------------------------

### Training and Using SentencePiece BPE Model in Python

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet demonstrates how to train a Byte Pair Encoding (BPE) model using SentencePiece, load the trained model, and then use it for subword segmentation. It shows `encode_as_pieces` for standard segmentation and `nbest_encode_as_pieces` which returns an empty list for BPE models as they don't support n-best segmentation. The model is trained on `botchan.txt` with a vocabulary size of 2000.

```Python
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bpe --vocab_size=2000 --model_type=bpe')
sp_bpe = spm.SentencePieceProcessor()
sp_bpe.load('m_bpe.model')

print('*** BPE ***')
print(sp_bpe.encode_as_pieces('thisisatesthelloworld'))
print(sp_bpe.nbest_encode_as_pieces('hello world', 5))  # returns an empty list.
```

--------------------------------

### Configuring CPack for Package Generation (CMake)

Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt

Configures CPack, CMake's packaging tool, to define how installation packages are generated. It sets the source and binary package generators, package version, license and readme files, contact information, and files to ignore during source packaging. Finally, `include(CPack)` enables the CPack module.

```CMake
set(CPACK_SOURCE_GENERATOR "TXZ")
set(CPACK_GENERATOR "7Z")
set(CPACK_PACKAGE_VERSION "${SPM_VERSION}")
set(CPACK_STRIP_FILES TRUE)
set(CPACK_RESOURCE_FILE_LICENSE "${PROJECT_SOURCE_DIR}/LICENSE")
set(CPACK_RESOURCE_FILE_README "${PROJECT_SOURCE_DIR}/README.md")
set(CPACK_PACKAGE_CONTACT "taku@google.com")
set(CPACK_DEBIAN_PACKAGE_MAINTAINER "Taku Kudo")
set(CPACK_SOURCE_IGNORE_FILES "/build/;/.git/;/dist/;/sdist/;~$;${CPACK_SOURCE_IGNORE_FILES}")
include(CPack)
```

--------------------------------

### Using ImmutableSentencePieceText for Decoding in C++

Source: https://github.com/google/sentencepiece/blob/master/doc/api.md

This example illustrates how to use `ImmutableSentencePieceText` for decoding a sequence of IDs. It shows how to populate the proto with decoded data and then access the reconstructed text along with detailed information for each piece, including byte offsets and IDs.

```C++
processor.Decode({10, 20, 30}, spt.mutable_proto());
std::cout << spt.text() << std::endl;   // This is the same as the decoded string.
for (const auto &piece : spt.pieces()) {
   // the same as above.
}
```

--------------------------------

### Encoding Text with BOS/EOS Markers and Reversal

Source: https://github.com/google/sentencepiece/blob/master/README.md

These examples show how to use the `--extra_options` flag with `spm_encode` to add beginning-of-sentence (BOS) and/or end-of-sentence (EOS) markers to the encoded output, or to reverse the input sequence before encoding.

```Shell
spm_encode --extra_options=eos (add </s> only)
spm_encode --extra_options=bos:eos (add <s> and </s>)
spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>)
```

--------------------------------

### Loading SentencePiece Model from Byte Stream (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet illustrates how to load a SentencePiece model directly from a serialized protocol buffer byte stream, rather than from a file path. It uses TensorFlow's `tf.io.gfile.GFile` to read the model file into memory, which is useful when models are stored in non-Posix file systems or need to be loaded dynamically. After loading, it demonstrates encoding text.

```python
import tensorflow as tf

# Assumes that m.model is stored in non-Posix file system.
serialized_model_proto = tf.io.gfile.GFile('m.model', 'rb').read()

sp = spm.SentencePieceProcessor()
sp.load_from_serialized_proto(serialized_model_proto)

print(sp.encode_as_pieces('this is a test'))
```

--------------------------------

### Performing Sampled Segmentation for Subword Regularization in SentencePiece (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet demonstrates how to use sampled segmentation with a SentencePiece Unigram model for subword regularization, a technique for data augmentation. It shows how to obtain different segmentations for the same input text by repeatedly calling sample_encode_as_pieces and sample_encode_as_ids with specified nbest_size and inverse_temperature parameters.

```python
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

# Can obtain different segmentations per request.
# There are two hyperparameters for sampling (nbest_size and inverse temperature). see the paper [kudo18] for detail.
for n in range(10):
  print(sp.sample_encode_as_pieces('hello world', -1, 0.1))

for n in range(10):
  print(sp.sample_encode_as_ids('hello world', -1, 0.1))
```

--------------------------------

### Setting Extra Decoding Options in C++

Source: https://github.com/google/sentencepiece/blob/master/doc/api.md

This snippet demonstrates how to use `SetDecodeExtraOptions` to apply modifications during the decoding process. It shows an example of reversing the decoder's output. This method should be invoked immediately after the model has been loaded.

```C++
processor.SetDecodeExtraOptions("reverse");   // the decoder's output is reversed.
```

--------------------------------

### Training SentencePiece with Pre-defined Normalization Rule in Python

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet shows how to train a SentencePiece model using a pre-defined normalization rule, specifically `nfkc_cf` for NFKC normalization and Unicode case folding (lower casing). After training, the model is loaded and used to demonstrate how input text is automatically normalized during segmentation, converting full-width characters and uppercasing to their normalized, lowercased forms.

```Python
# NFKC normalization and lower casing.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_name=nfkc_cf')

sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('ＨＥＬＬＯ　ＷＯＲＬＤ.'))  # lower casing and normalization
```

--------------------------------

### Setting Extra Encoding Options in C++

Source: https://github.com/google/sentencepiece/blob/master/doc/api.md

This example shows how to use `SetEncodeExtraOptions` to modify the encoding behavior. It demonstrates adding beginning-of-sentence (bos) and end-of-sentence (eos) tokens, and also reversing the input before adding these tokens. This method must be called after the model is loaded.

```C++
processor.SetEncodeExtraOptions("bos:eos");   // add <s> and </s>.
processor.SetEncodeExtraOptions("reverse:bos:eos");   // reverse the input and then add <s> and </s>.
```

--------------------------------

### Customizing UNK Symbol Surface Representation in SentencePiece (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This example shows how to change the surface representation of the Unknown (UNK) symbol when decoding IDs in SentencePiece. It first demonstrates the default UNK surface (U+2047) and then trains a new model with the --unk_surface flag to set a custom string, __UNKNOWN__, for the UNK symbol's decoded form.

```python
spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.decode_ids([sp.unk_id()]))   # default is U+2047

spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --unk_surface=__UNKNOWN__')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.decode_ids([sp.unk_id()]))
```

--------------------------------

### Getting Vocabulary Size using len() with SentencePiece in Python

Source: https://github.com/google/sentencepiece/blob/master/python/README.md

Demonstrates using the built-in len() function on a SentencePieceProcessor instance to retrieve the vocabulary size, providing a convenient alternative to get_piece_size().

```Python
len(sp)
```

--------------------------------

### Encoding and Decoding Control Symbols with SentencePiece in Python

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet demonstrates how SentencePiece handles pre-defined control symbols like <sep> and <cls>. It shows encoding text containing these symbols into pieces, retrieving their corresponding IDs, and decoding these IDs back to their surface forms, which often results in an empty string for control symbols.

```python
print(sp_ctrl.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_ctrl.piece_to_id('<sep>'))  # 3
print(sp_ctrl.piece_to_id('<cls>'))  # 4
print('3=', sp_ctrl.decode_ids([3]))  # decoded to empty
print('4=', sp_ctrl.decode_ids([4]))  # decoded to empty
```

--------------------------------

### Accessing and Using Special Symbol IDs in SentencePiece (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet demonstrates how to retrieve the default IDs for Beginning-of-Sentence (BOS), End-of-Sentence (EOS), Unknown (UNK), and Padding (PAD) symbols using bos_id(), eos_id(), unk_id(), and pad_id() methods in SentencePiece. It also shows how to manually prepend and append BOS/EOS IDs to an encoded sequence of text.

```python
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

print('bos=', sp.bos_id())
print('eos=', sp.eos_id())
print('unk=', sp.unk_id())
print('pad=', sp.pad_id())  # disabled by default


print(sp.encode_as_ids('Hello world'))

# Prepend or append bos/eos ids.
print([sp.bos_id()] + sp.encode_as_ids('Hello world') + [sp.eos_id()])
```

--------------------------------

### Getting Vocabulary Size with SentencePiece in Python

Source: https://github.com/google/sentencepiece/blob/master/python/README.md

Retrieves the total number of unique pieces (vocabulary size) in the loaded SentencePiece model using sp.get_piece_size(). This indicates the size of the model's vocabulary.

```Python
sp.get_piece_size()
```

--------------------------------

### Detokenizing Text from Pieces or IDs in C++

Source: https://github.com/google/sentencepiece/blob/master/doc/api.md

This example shows how to use `SentencePieceProcessor::Decode` to reconstruct text from either a sequence of string pieces or a sequence of integer IDs. It highlights that detokenization is designed to be the inverse operation of encoding, ensuring consistency.

```C++
std::vector<std::string> pieces = { " This", " is", " a", " ", "te", "st", "." };   // sequence of pieces
std::string text;
processor.Decode(pieces, &text);
std::cout << text << std::endl;

std::vector<int> ids = { 451, 26, 20, 3, 158, 128, 12  };   // sequence of ids
processor.Decode(ids, &text);
std::cout << text << std::endl;
```

--------------------------------

### Training SentencePiece with Custom Normalization Rules from TSV in Python

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet demonstrates how to define and apply custom text normalization rules in SentencePiece using a TSV file. It includes a helper function `tocode` to convert strings to Unicode code points for the TSV format, creates a `normalization_rule.tsv` file with "I'm" to "I am" and "don't" to "do not" mappings, and then trains a SentencePiece model using this custom rule. Finally, it shows how the loaded model automatically applies these rules during segmentation.

```Python
def tocode(s):
    out = []
    for c in s:
        out.append(str(hex(ord(c))).replace('0x', 'U+'))
    return ' '.join(out)


# TSV format:  source Unicode code points <tab> target code points
# normalize "don't => do not,  I'm => I am"
with open('normalization_rule.tsv', 'w') as f:
  f.write(tocode("I'm") + '\t' + tocode("I am") + '\n')
  f.write(tocode("don't") + '\t' + tocode("do not") + '\n')

print(open('normalization_rule.tsv', 'r').read())

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_tsv=normalization_rule.tsv')

sp = spm.SentencePieceProcessor()
# m.model embeds the normalization rule compiled into an FST.
sp.load('m.model')
print(sp.encode_as_pieces("I'm busy"))  # normalized to `I am busy'
print(sp.encode_as_pieces("I don't know it."))  # normalized to 'I do not know it.'
```

--------------------------------

### Initializing SentencePieceProcessor in Python

Source: https://github.com/google/sentencepiece/blob/master/python/README.md

Demonstrates how to import the SentencePiece library and initialize a SentencePieceProcessor instance by loading a pre-trained model file. This is the first step to perform any segmentation or tokenization operations.

```Python
import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file='test/test_model.model')
```

--------------------------------

### Disabling BOS/EOS Symbols in SentencePiece (Python)

Source: https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

This snippet demonstrates how to disable Beginning-of-Sentence (BOS) and End-of-Sentence (EOS) symbols in a SentencePiece model by setting their IDs to -1 during training. When disabled, these symbols are no longer treated as special tokens and will instead be mapped to the Unknown (UNK) symbol ID if encountered.

```python
spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --bos_id=-1 --eos_id=-1')
sp = spm.SentencePieceProcessor()
sp.load('m.model')

# <s>, </s> are UNK.
print(sp.unk_id())
print(sp.piece_to_id('<s>'))
print(sp.piece_to_id('</s>'))
```

--------------------------------

### Initializing CMake Project and Setting Version

Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt

This snippet sets the minimum required CMake version, reads the project version from VERSION.txt, displays it, and then defines the SentencePiece project with C and C++ language support. It also handles CMake policy CMP0091 for modern behavior.

```CMake
cmake_minimum_required(VERSION 3.5 FATAL_ERROR)
file(STRINGS "VERSION.txt" SPM_VERSION)
message(STATUS "VERSION: ${SPM_VERSION}")

if(POLICY CMP0091)
  cmake_policy(SET CMP0091 NEW)
endif()

project(sentencepiece VERSION ${SPM_VERSION} LANGUAGES C CXX)
```

--------------------------------

### Loading SentencePiece Model in C++

Source: https://github.com/google/sentencepiece/blob/master/doc/api.md

This snippet demonstrates how to initialize the SentencePieceProcessor and load a pre-trained model. It shows loading from a file path and includes basic error handling. It also comments on the alternative method of loading from a serialized proto string.

```C++
#include <sentencepiece_processor.h>

sentencepiece::SentencePieceProcessor processor;
const auto status = processor.Load("//path/to/model.model");
if (!status.ok()) {
   std::cerr << status.ToString() << std::endl;
   // error
}

// You can also load a serialized model from std::string.
// const std::string str = // Load blob contents from a file.
// auto status = processor.LoadFromSerializedProto(str);
```

--------------------------------

### Training a SentencePiece Model in Python

Source: https://github.com/google/sentencepiece/blob/master/python/README.md

Illustrates how to train a new SentencePiece model using SentencePieceTrainer.train(). It takes an input text file, a model prefix for output files, and parameters like vocabulary size and user-defined symbols.

```Python
import sentencepiece as spm
spm.SentencePieceTrainer.train(input='test/botchan.txt', model_prefix='m', vocab_size=1000, user_defined_symbols=['foo', 'bar'])
```

--------------------------------

### Verifying SentencePiece Release Binary with SLSA

Source: https://github.com/google/sentencepiece/blob/master/README.md

This command uses the `slsa-verifier` tool to verify the authenticity and integrity of a SentencePiece release binary. It requires the artifact path, the downloaded provenance file (`attestation.intoto.jsonl`), the source repository, and the release tag.

```Shell
slsa-verifier -artifact-path <the-wheel> -provenance attestation.intoto.jsonl -source github.com/google/sentencepiece -tag <the-tag>
```

--------------------------------

### Configuring Project Files from Templates (CMake)

Source: https://github.com/google/sentencepiece/blob/master/CMakeLists.txt

Uses `configure_file` to generate `config.h` from `config.h.in` and `sentencepiece.pc` from `sentencepiece.pc.in`. The `@ONLY` option for `sentencepiece.pc` ensures that only CMake variables are substituted, not shell variables.

```CMake
configure_file("${PROJECT_SOURCE_DIR}/config.h.in" "config.h")
configure_file("${PROJECT_SOURCE_DIR}/sentencepiece.pc.in" "sentencepiece.pc" @ONLY)
```

--------------------------------

### Training a SentencePiece Model

Source: https://github.com/google/sentencepiece/blob/master/README.md

This command trains a new SentencePiece model from a raw corpus file. Key parameters include the input file, output model prefix, desired vocabulary size, character coverage, and the model type (unigram, BPE, char, or word).

```Shell
spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>
```