### Building Libtorchtext and Examples

Source: https://github.com/pytorch/text/blob/main/examples/libtorchtext/README.md

This bash script provides the commands to build `libtorchtext` and its associated example applications. It first grants execute permission to `build.sh` and then runs the script to initiate the build process.

```bash
chmod +x build.sh # give script execute permission
./build.sh
```

--------------------------------

### Building torchtext from Source on Linux

Source: https://github.com/pytorch/text/blob/main/README.rst

This command compiles and installs torchtext from its source code on Linux systems. It cleans previous builds and performs a fresh installation.

```Shell
python setup.py clean install
```

--------------------------------

### Installing torchtext using Pip

Source: https://github.com/pytorch/text/blob/main/README.rst

This command installs the torchtext library using the pip package installer. This is a standard Python package installation method.

```Shell
pip install torchtext
```

--------------------------------

### Installing SacreMoses Tokenizer

Source: https://github.com/pytorch/text/blob/main/README.rst

This command installs the SacreMoses library, which provides a port of the Moses tokenizer. It is an optional dependency for using the Moses tokenizer with torchtext.

```Shell
pip install sacremoses
```

--------------------------------

### Running Basic Text Classification Training Script (Bash)

Source: https://github.com/pytorch/text/blob/main/examples/text_classification/README.md

This snippet executes a shell script to run the basic text classification model training. It's typically used for a quick start or to run a predefined training pipeline, likely involving the AG_NEWS dataset as mentioned in the surrounding text.

```bash
./run_script.sh
```

--------------------------------

### Configuring libtorchtext C++ Example with CMake

Source: https://github.com/pytorch/text/blob/main/examples/libtorchtext/CMakeLists.txt

This snippet defines the minimum CMake version, sets the project name, and configures build options for a C++ example using libtorchtext. It finds the required Torch package, applies its C++ flags, and includes libtorchtext and tokenizer subdirectories.

```CMake
cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
project(libtorchtext_cpp_example)

SET(BUILD_TORCHTEXT_PYTHON_EXTENSION OFF CACHE BOOL "Build Python binding")

find_package(Torch REQUIRED)
message("libtorchtext CMakeLists: ${TORCH_CXX_FLAGS}")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")

add_subdirectory(../.. libtorchtext)
add_subdirectory(tokenizer)
```

--------------------------------

### Preparing Example Input and Loading a JIT-Compiled T5 Model

Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb

This snippet defines a list of example input strings formatted for T5 tasks (question answering and summarization). It then loads a JIT-compiled T5-Large generation model using the `get_jit_from_bundle` utility function, preparing it for inference.

```Python
EXAMPLE_INPUT =  [
    'question: What does Nir likes to eat? context: Nir is a PM on the Care AI team. Nir only eats vegeterian food and he loves Pizza',
    'question: Who likes to eat pizza? context: Nir is a PM on the Care AI team. Nir only eats vegeterian food and he loves Pizza',
    "summarize: studies say that owning a dog is good for you",
]

t5_large = get_jit_from_bundle(T5_LARGE_GENERATION)
```

--------------------------------

### Building torchtext from Source on OSX

Source: https://github.com/pytorch/text/blob/main/README.rst

This command compiles and installs torchtext from its source code on OSX systems, explicitly using clang as the C++ compiler. It cleans previous builds and performs a fresh installation.

```Shell
CC=clang CXX=clang++ python setup.py clean install
```

--------------------------------

### Installing SpaCy and English Model

Source: https://github.com/pytorch/text/blob/main/README.rst

These commands install the SpaCy library and download its small English language model, which is required if you intend to use SpaCy's English tokenizer with torchtext.

```Shell
pip install spacy
python -m spacy download en_core_web_sm
```

--------------------------------

### Cloning and Initializing torchtext Source Repository

Source: https://github.com/pytorch/text/blob/main/README.rst

These commands clone the torchtext repository from GitHub and initialize its submodules, which are necessary for building the library from source.

```Shell
git clone https://github.com/pytorch/text torchtext
cd torchtext
git submodule update --init --recursive
```

--------------------------------

### Training Text Classification Model with SentencePiece and YelpReviewFull (Python)

Source: https://github.com/pytorch/text/blob/main/examples/text_classification/README.md

This command initiates the training of a text classification model using Python. It specifies the 'YelpReviewFull' dataset, utilizes a CUDA-enabled device, enables SentencePiece tokenization, sets the number of training epochs to 10, and configures the embedding dimension to 64. This setup aims to reproduce fastText results.

```python
python train.py YelpReviewFull --device cuda --use-sp-tokenizer True --num-epochs 10 --embed-dim 64
```

--------------------------------

### Installing SentencePiece for older torchtext versions

Source: https://github.com/pytorch/text/blob/main/README.rst

This command installs the SentencePiece library using Conda. It is specifically required for torchtext versions 0.5 and below for subword tokenization.

```Shell
conda install -c powerai sentencepiece
```

--------------------------------

### Developing torchtext from Source

Source: https://github.com/pytorch/text/blob/main/README.rst

This command installs torchtext in 'develop' mode, which links the installed package to the source directory. This is useful for developers making modifications to the library without needing to reinstall after every change.

```Shell
python setup.py develop
```

--------------------------------

### Installing torchtext using Conda

Source: https://github.com/pytorch/text/blob/main/README.rst

This command installs the torchtext library using the Conda package manager from the PyTorch channel. It is a recommended method for managing Python packages and their dependencies.

```Shell
conda install -c pytorch torchtext
```

--------------------------------

### Conditional Python Extension Build Setup (CMake)

Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt

This snippet conditionally sets up the build process for the `_torchtext.so` Python extension, activated by `BUILD_TORCHTEXT_PYTHON_EXTENSION`. It finds the `torch_python` library and, for Windows, ensures Python Development components are available, which are prerequisites for building Python extensions.

```CMake
if (BUILD_TORCHTEXT_PYTHON_EXTENSION)
  # See https://github.com/pytorch/pytorch/issues/38122
  find_library(TORCH_PYTHON_LIBRARY torch_python PATHS "${TORCH_INSTALL_PREFIX}/lib")
  if (WIN32)
    find_package(Python3 ${PYTHON_VERSION} EXACT COMPONENTS Development)
    set(ADDITIONAL_ITEMS Python3::Python)
  endif()
```

--------------------------------

### Setting CMake Module Path and Torch Prefixes

Source: https://github.com/pytorch/text/blob/main/CMakeLists.txt

Extends the CMake module search path, defines `TORCH_INSTALL_PREFIX` for locating PyTorch installations, and sets `TORCH_COMPILED_WITH_CXX_ABI` to ensure ABI compatibility with PyTorch.

```CMake
set(CMAKE_MODULE_PATH "${CMAKE_MODULE_PATH};${CMAKE_CURRENT_SOURCE_DIR}/cmake")
set(TORCH_INSTALL_PREFIX "${CMAKE_PREFIX_PATH}/../.." CACHE STRING "Install path for torch")
set(TORCH_COMPILED_WITH_CXX_ABI "-D_GLIBCXX_USE_CXX11_ABI=0" CACHE STRING "Compile torchtext with cxx11_abi")
```

--------------------------------

### Installing pre-commit for Python Code Formatting (conda)

Source: https://github.com/pytorch/text/blob/main/CONTRIBUTING.md

This command installs the `pre-commit` tool using conda from the `conda-forge` channel. It serves the same purpose as the pip installation, providing the necessary tool for `torchtext`'s code style enforcement and pre-commit hooks.

```shell
conda install -c conda-forge pre-commit
```

--------------------------------

### Installing TorchArrow with PyTorch Dependency (Bash)

Source: https://github.com/pytorch/text/blob/main/examples/torcharrow/README.md

This command installs TorchArrow from source, ensuring that the `USE_TORCH=1` flag is set. This flag is crucial for enabling natively integrated text operators like `bpe_tokenize` and `lookup_indices` which depend on the PyTorch library, as TorchArrow does not include this dependency by default.

```Bash
USE_TORCH=1 python setup.py install
```

--------------------------------

### Defining libtorchtext Include Directories (CMake)

Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt

This snippet specifies the necessary include directories for compiling the `libtorchtext` library. It includes paths to the project's source, third-party dependencies like SentencePiece, re2, double-conversion, utf8proc, and PyTorch installation directories for API headers.

```CMake
set(
  LIBTORCHTEXT_INCLUDE_DIRS
  ${PROJECT_SOURCE_DIR}
  ${PROJECT_SOURCE_DIR}/third_party/sentencepiece/src
  $<TARGET_PROPERTY:re2,INCLUDE_DIRECTORIES>
  $<TARGET_PROPERTY:double-conversion,INCLUDE_DIRECTORIES>
  $<TARGET_PROPERTY:utf8proc,INCLUDE_DIRECTORIES>
  ${TORCH_INSTALL_PREFIX}/include
  ${TORCH_INSTALL_PREFIX}/include/torch/csrc/api/include
  )
```

--------------------------------

### Loading and Tokenizing IMDB Dataset with torchtext.datasets (Python)

Source: https://github.com/pytorch/text/blob/main/docs/source/datasets.rst

This example illustrates how to load the IMDB dataset using `torchtext.datasets.IMDB` and iterate through its elements. It shows a basic tokenization function applied to each line, demonstrating how to process the `(label, line)` pairs yielded by the dataset iterator. The `split='train'` argument specifies the dataset partition to load.

```Python
# import datasets
from torchtext.datasets import IMDB

train_iter = IMDB(split='train')

def tokenize(label, line):
    return line.split()

tokens = []
for label, line in train_iter:
    tokens += tokenize(label, line)
```

--------------------------------

### Installing pre-commit for Python Code Formatting (pip)

Source: https://github.com/pytorch/text/blob/main/CONTRIBUTING.md

This command installs the `pre-commit` tool using pip, which is used to enforce code style for Python, text, and configuration files in `torchtext`. It's a prerequisite for automatically checking and fixing code format before committing changes.

```shell
pip install pre-commit
```

--------------------------------

### Defining Python Project Dependencies

Source: https://github.com/pytorch/text/blob/main/docs/requirements.txt

This snippet specifies the Python packages and their exact versions required for the project. It includes standard packages like Jinja2, Sphinx, matplotlib, and regex, along with a direct installation from a Git repository for the PyTorch Sphinx theme.

```Python
Jinja2<3.1.0
sphinx==5.1.1
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@cece053#egg=pytorch_sphinx_theme
sphinx_gallery==0.11.1
matplotlib
regex
```

--------------------------------

### Defining Python Extension Build Function (CMake)

Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt

This CMake function `define_extension` defines how to build a Python shared extension module. It configures the library with specified sources, include directories (including Python's), link libraries (including `torch_python`), and compile definitions. It also handles platform-specific properties like `.pyd` suffix for MSVC and `LINK_FLAGS` for Apple, and sets installation rules.

```CMake
function(define_extension name sources include_dirs link_libraries definitions)
    add_library(${name} SHARED ${sources})
    target_compile_definitions(${name} PRIVATE "${definitions}")
    target_include_directories(
      ${name} PRIVATE ${Python_INCLUDE_DIR} ${include_dirs})
    target_link_libraries(
      ${name}
      ${link_libraries}
      ${TORCH_PYTHON_LIBRARY}
      ${ADDITIONAL_ITEMS}
      )
    set_target_properties(${name} PROPERTIES PREFIX "")
    if (MSVC)
      set_target_properties(${name} PROPERTIES SUFFIX ".pyd")
    endif(MSVC)
    if (APPLE)
      # https://github.com/facebookarchive/caffe2/issues/854#issuecomment-364538485
      # https://github.com/pytorch/pytorch/commit/73f6715f4725a0723d8171d3131e09ac7abf0666
      set_target_properties(${name} PROPERTIES LINK_FLAGS "-undefined dynamic_lookup")
    endif()
    install(
      TARGETS ${name}
      LIBRARY DESTINATION .
      RUNTIME DESTINATION .  # For Windows
      )
  endfunction()
```

--------------------------------

### Defining Shared Library Build Function (CMake)

Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt

This CMake function `define_library` encapsulates the logic for building a shared library. It takes the library name, source files, include directories, link libraries, and compile definitions as arguments, then configures the library properties, including setting a `.pyd` suffix for MSVC builds and defining installation rules.

```CMake
function (define_library name source include_dirs link_libraries compile_defs)
  add_library(${name} SHARED ${source})
  target_include_directories(${name} PRIVATE ${include_dirs})
  target_link_libraries(${name} ${link_libraries})
  target_compile_definitions(${name} PRIVATE ${compile_defs})
  set_target_properties(${name} PROPERTIES PREFIX "")
  if (MSVC)
    set_target_properties(${name} PROPERTIES SUFFIX ".pyd")
  endif(MSVC)
  install(
    TARGETS ${name}
    LIBRARY DESTINATION lib
    RUNTIME DESTINATION lib  # For Windows
    )
endfunction()
```

--------------------------------

### Initializing T5 Models and Data - Python

Source: https://github.com/pytorch/text/blob/main/notebooks/hf_vs_tt_t5.ipynb

This snippet initializes input and output sentences for translation, then prepares both TorchText and Hugging Face T5 models. It obtains the `transform` function and the TorchText T5 model (`tt_t5_model`) from `T5_BASE`, and loads the Hugging Face T5 base model (`hf_t5_model`) using `T5Model.from_pretrained`.

```Python
input_sentence = ["translate to Spanish: My name is Joe"]
output_sentence = ["Me llamo Joe"]

transform = T5_BASE.transform()
tt_t5_model = T5_BASE.get_model()

hf_t5_model = T5Model.from_pretrained("t5-base")
```

--------------------------------

### Utility Functions for Loading and JIT Compiling T5 Models in PyTorch

Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb

This collection of functions provides a structured way to build, load, and prepare T5 models from bundles. It includes `build_model` for flexible model instantiation, `load_model` for loading from a bundle, `get_model_from_bundle` for creating a `TorchScriptableT5` instance, and `get_jit_from_bundle` for obtaining a JIT-compiled version, optionally with CUDA support.

```Python
from typing import Optional, Union, Dict, Any
from torchtext import _TEXT_BUCKET
from urllib.parse import urljoin
from torchtext._download_hooks import load_state_dict_from_url


def build_model(
    config: T5Conf,
    T5Class=T5Model,
    freeze_model: bool = False,
    checkpoint: Optional[Union[str, Dict[str, torch.Tensor]]] = None,
    strict: bool = False,
    dl_kwargs: Optional[Dict[str, Any]] = None,
) -> T5Model:
    """Class builder method that can overide the default T5Model model class 
    
    (reference: https://github.com/pytorch/text/blob/a1dc61b8e80df70fe7a35b9f5f5cc7e19c7dd8a3/torchtext/models/t5/bundler.py#L113)
    
    Args:
        config (T5Conf): An instance of classT5Conf that defined the model configuration
        freeze_model (bool): Indicates whether to freeze the model weights. (Default: `False`)
        checkpoint (str or Dict[str, torch.Tensor]): Path to or actual model state_dict. state_dict can have partial weights i.e only for encoder. (Default: ``None``)
        strict (bool): Passed to :func: `torch.nn.Module.load_state_dict` method. (Default: `False`)
        dl_kwargs (dictionary of keyword arguments): Passed to :func:`torch.hub.load_state_dict_from_url`. (Default: `None`)
    """
    model = T5Class(config, freeze_model)
    if checkpoint is not None:
        if torch.jit.isinstance(checkpoint, Dict[str, torch.Tensor]):
            state_dict = checkpoint
        elif isinstance(checkpoint, str):
            dl_kwargs = {} if dl_kwargs is None else dl_kwargs
            state_dict = load_state_dict_from_url(checkpoint, **dl_kwargs)
        else:
            raise TypeError(
                "checkpoint must be of type `str` or `Dict[str, torch.Tensor]` but got {}".format(type(checkpoint))
            )

        model.load_state_dict(state_dict, strict=strict)

    return model


def load_model(bundle, T5Class=T5TorchGenerative):
    """
    
    Example usage:
    >> model = load_model(bundle=T5_SMALL_GENERATION, T5Class=T5TorchGenerative)
    """
    return build_model(config=bundle.config, T5Class=T5Class, checkpoint=bundle._path)


def get_model_from_bundle(bundle, cuda=False):
    model = load_model(bundle=bundle, T5Class=T5TorchGenerative)
    tokenizer = bundle.transform()
    full_model = TorchScriptableT5(model=model, transform=tokenizer, cuda=cuda)
    return full_model

def get_jit_from_bundle(bundle, cuda=False):
    full_model = get_model_from_bundle(bundle, cuda=cuda)
    full_model_jit = torch.jit.script(full_model)
    return full_model_jit
```

--------------------------------

### Downloading GPT2 BPE Tokenizer Artifacts (Bash)

Source: https://github.com/pytorch/text/blob/main/examples/libtorchtext/tokenizer/README.md

This snippet downloads the necessary `gpt2_bpe_vocab.bpe` and `gpt2_bpe_encoder.json` files, which are prerequisites for constructing the `GPT2BPETokenizer` object in subsequent steps.

```bash
curl -O https://download.pytorch.org/models/text/gpt2_bpe_vocab.bpe
curl -O https://download.pytorch.org/models/text/gpt2_bpe_encoder.json
```

--------------------------------

### Running RoBERTa SST-2 Training Script (Bash)

Source: https://github.com/pytorch/text/blob/main/examples/torcharrow/README.md

This command executes the `roberta_sst2_training_with_torcharrow.py` script, initiating the end-to-end training process for SST-2 binary classification. It configures the training with a batch size of 16, runs for 1 epoch, and sets the learning rate to 1e-5, demonstrating the usage of the TorchArrow-based pipeline.

```Bash
python roberta_sst2_training_with_torcharrow.py \
        --batch-size 16 \
        --num-epochs 1 \
        --learning-rate 1e-5
```

--------------------------------

### Dynamically Modifying Tutorial and GitHub Links (JavaScript)

Source: https://github.com/pytorch/text/blob/main/docs/source/_templates/layout.html

This JavaScript snippet, executed on document ready, dynamically updates the 'Run in Google Colab', 'Download Notebook', and 'View on GitHub' links for tutorials. It also overwrites the main 'GitHub' link in the navigation menu to point to the `pytorch/text` repository, ensuring correct resource access for users.

```JavaScript
var collapsedSections = [];
$(document).ready(function() {
    var downloadNote = $(".sphx-glr-download-link-note.admonition.note");
    if (downloadNote.length >= 1) {
        var tutorialUrl = $("#tutorial-type").text();
        var githubLink = "https://github.com/pytorch/text/blob/main/examples/" + tutorialUrl + ".py",
            notebookLink = $(".reference.download")[1].href,
            notebookDownloadPath = notebookLink.split('_downloads')[1],
            colabLink = "https://colab.research.google.com/github/pytorch/text/blob/gh-pages/main/_downloads" + notebookDownloadPath;
        $(".pytorch-call-to-action-links a[data-response='Run in Google Colab']").attr("href", colabLink);
        $(".pytorch-call-to-action-links a[data-response='View on Github']").attr("href", githubLink);
    }
    // Overwrite the link to GitHub project
    var overwrite = function(_) {
        if ($(this).length > 0) {
            $(this)[0].href = "https://github.com/pytorch/text"
        }
    }
    // PC
    $(".main-menu a:contains('GitHub')").each(overwrite);
    // Mobile
    $(".main-menu a:contains('Github')").each(overwrite);
});
```

--------------------------------

### Running Unit Tests with Pytest (Python)

Source: https://github.com/pytorch/text/blob/main/requirements.txt

This snippet lists testing frameworks and libraries, including 'pytest' for running unit tests, 'expecttest' for snapshot testing, and 'parameterized' for creating parameterized test cases, all essential for ensuring code quality.

```Python
pytest
expecttest
parameterized
```

--------------------------------

### T5 Model Construction and Loading Utilities in PyTorch

Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb

This snippet provides a set of utility functions for building and loading T5 models in PyTorch. The `build_model` function allows for flexible model instantiation from a configuration and supports loading partial or full model checkpoints. `load_model` simplifies the process by loading a model directly from a bundle, while `get_model_from_bundle` and `get_jit_from_bundle` further encapsulate the creation of a full, scriptable T5 model with its associated tokenizer.

```python
from typing import Optional, Union, Dict, Any
from torchtext import _TEXT_BUCKET
from urllib.parse import urljoin
from torchtext._download_hooks import load_state_dict_from_url


def build_model(
    config: T5Conf,
    T5Class=T5Model,
    freeze_model: bool = False,
    checkpoint: Optional[Union[str, Dict[str, torch.Tensor]]] = None,
    strict: bool = False,
    dl_kwargs: Optional[Dict[str, Any]] = None,
) -> T5Model:
    """Class builder method that can overide the default T5Model model class 
    
    (reference: https://github.com/pytorch/text/blob/a1dc61b8e80df70fe7a35b9f5f5cc7e19c7dd8a3/torchtext/models/t5/bundler.py#L113)
    
    Args:
        config (T5Conf): An instance of classT5Conf that defined the model configuration
        freeze_model (bool): Indicates whether to freeze the model weights. (Default: `False`)
        checkpoint (str or Dict[str, torch.Tensor]): Path to or actual model state_dict. state_dict can have partial weights i.e only for encoder. (Default: ``None``)
        strict (bool): Passed to :func: `torch.nn.Module.load_state_dict` method. (Default: `False`)
        dl_kwargs (dictionary of keyword arguments): Passed to :func:`torch.hub.load_state_dict_from_url`. (Default: `None`)
    """
    model = T5Class(config, freeze_model)
    if checkpoint is not None:
        if torch.jit.isinstance(checkpoint, Dict[str, torch.Tensor]):
            state_dict = checkpoint
        elif isinstance(checkpoint, str):
            dl_kwargs = {} if dl_kwargs is None else dl_kwargs
            state_dict = load_state_dict_from_url(checkpoint, **dl_kwargs)
        else:
            raise TypeError(
                "checkpoint must be of type `str` or `Dict[str, torch.Tensor]` but got {}".format(type(checkpoint))
            )

        model.load_state_dict(state_dict, strict=strict)

    return model


def load_model(bundle, T5Class=T5TorchGenerative):
    """
    
    Example usage:
    >> model = load_model(bundle=T5_SMALL_GENERATION, T5Class=T5TorchGenerative)
    """
    return build_model(config=bundle.config, T5Class=T5Class, checkpoint=bundle._path)


def get_model_from_bundle(bundle):
    model = load_model(bundle=bundle, T5Class=T5TorchGenerative)
    tokenizer = bundle.transform()
    full_model = TorchScriptableT5(model=model, transform=tokenizer)
    return full_model

def get_jit_from_bundle(bundle):
    full_model = get_model_from_bundle(bundle)
```

--------------------------------

### Importing Libraries for T5 Model Comparison - Python

Source: https://github.com/pytorch/text/blob/main/notebooks/hf_vs_tt_t5.ipynb

This snippet imports necessary libraries for comparing TorchText's T5 model with Hugging Face's T5 model. It includes `T5Model` from `transformers` for the Hugging Face implementation, `T5_BASE` from `torchtext.prototype.models` for the TorchText implementation, and `torch` for tensor operations and assertions.

```Python
from transformers import T5Model
from torchtext.prototype.models import T5_BASE

import torch
```

--------------------------------

### Importing Hugging Face and TorchText Generation Utilities (Python)

Source: https://github.com/pytorch/text/blob/main/notebooks/hf_with_torchtext_gen.ipynb

This snippet imports necessary classes from the `transformers` library for various models (T5, BART, GPT2) and their tokenizers, along with `GenerationUtil` from `torchtext.prototype.generate`, which is used for abstracting the generation process. It sets up the required dependencies for subsequent model operations.

```python
from transformers import T5ForConditionalGeneration, T5Tokenizer, BartForConditionalGeneration, BartTokenizer, GPT2LMHeadModel, GPT2Tokenizer
from torchtext.prototype.generate import GenerationUtil
```

--------------------------------

### JIT Compiling a PyTorch Model

Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb

This snippet demonstrates how to compile a PyTorch model using `torch.jit.script`. JIT compilation optimizes the model for deployment and can improve performance by tracing or scripting the model's execution graph.

```Python
full_model_jit = torch.jit.script(full_model)
return full_model_jit
```

--------------------------------

### Generating Documentation with Sphinx (Python)

Source: https://github.com/pytorch/text/blob/main/requirements.txt

This snippet specifies 'Sphinx', a widely used documentation generator that creates intelligent and beautiful documentation from reStructuredText or Markdown sources for Python projects.

```Python
Sphinx
```

--------------------------------

### Adding double-conversion Library as CMake Subdirectory

Source: https://github.com/pytorch/text/blob/main/third_party/CMakeLists.txt

This command includes the `double-conversion` project as a subdirectory. This library provides highly optimized and accurate conversions between floating-point numbers and their string representations. `EXCLUDE_FROM_ALL` prevents it from being built by default.

```CMake
add_subdirectory(double-conversion EXCLUDE_FROM_ALL)
```

--------------------------------

### Defining libtorchtext Source Files (CMake)

Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt

This snippet defines the C++ source files required to build the `libtorchtext` library. These files contain the core implementations of tokenizers, common utilities, and bindings for PyTorch. It lists various `.cpp` files like `clip_tokenizer.cpp`, `gpt2_bpe_tokenizer.cpp`, and `vocab.cpp`.

```CMake
set(
  LIBTORCHTEXT_SOURCES
  clip_tokenizer.cpp
  common.cpp
  gpt2_bpe_tokenizer.cpp
  regex.cpp
  regex_tokenizer.cpp
  register_torchbindings.cpp
  sentencepiece.cpp
  vectors.cpp
  vocab.cpp
  bert_tokenizer.cpp
  )
```

--------------------------------

### Running clang-format for C++ Code Formatting

Source: https://github.com/pytorch/text/blob/main/CONTRIBUTING.md

This command executes the `run-clang-format.py` script to recursively format C++ files within the `torchtext/csrc` directory. It requires the path to the `clang-format` executable to be provided via the `$CLANG_FORMAT` environment variable, ensuring strict C++ code style enforcement.

```shell
python run-clang-format.py \
    --recursive \
    --clang-format-executable=$CLANG_FORMAT \
    torchtext/csrc
```

--------------------------------

### Configuring Executable and Linking Libraries in CMake

Source: https://github.com/pytorch/text/blob/main/examples/libtorchtext/tokenizer/CMakeLists.txt

This CMake configuration defines an executable named 'tokenize' from 'main.cpp', links it against the PyTorch and TorchText libraries, and sets the C++ standard to C++14 for compilation. These steps are essential for building C++ applications that interact with PyTorch and TorchText functionalities.

```CMake
add_executable(tokenize main.cpp)
target_link_libraries(tokenize "${TORCH_LIBRARIES}" "${TORCHTEXT_LIBRARY}")
set_property(TARGET tokenize PROPERTY CXX_STANDARD 14)
```

--------------------------------

### Creating TorchScript Tokenizer File (Bash)

Source: https://github.com/pytorch/text/blob/main/examples/libtorchtext/tokenizer/README.md

This snippet executes a Python script to create a TorchScript object of the tokenizer, saving it to a specified file. It also verifies the tokenizer's output before and after saving/reloading, preparing it for use in a C++ application.

```bash
tokenizer_file="tokenizer.pt"
python create_tokenizer.py --tokenizer-file "${tokenizer_file}"
```

--------------------------------

### Performing Text Generation with GPT2 Model (Python)

Source: https://github.com/pytorch/text/blob/main/notebooks/hf_with_torchtext_gen.ipynb

This snippet showcases text generation using the GPT2 model, which is a decoder-only model. It configures `GenerationUtil` for GPT2, tokenizes an input prompt, and generates a continuation of the sequence. The output demonstrates GPT2's ability to complete sentences and generate coherent text based on a given prefix.

```python
# Testing Huggingface's GPT2
test_sequence = ["I enjoy walking with my cute dog"]
generative_hf_gpt2 = GenerationUtil(gpt2, is_encoder_decoder=False, is_huggingface_model=True)
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
test_sequence_tk = gpt2_tokenizer(test_sequence, return_tensors="pt").input_ids
tokens = generative_hf_gpt2.generate(test_sequence_tk, max_len=20, pad_idx=gpt2.config.pad_token_id)
print(gpt2_tokenizer.batch_decode(tokens, skip_special_tokens=True))
```

--------------------------------

### Defining Python Extension Include Directories (CMake)

Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt

This snippet specifies the include directories required for compiling the `_torchtext.so` Python extension. It mirrors many of the `libtorchtext` includes, ensuring access to common utilities and third-party dependencies, along with PyTorch headers for API integration.

```CMake
set(
    EXTENSION_INCLUDE_DIRS
    ${PROJECT_SOURCE_DIR}
    ${PROJECT_SOURCE_DIR}/third_party/sentencepiece/src
    $<TARGET_PROPERTY:re2,INCLUDE_DIRECTORIES>
    $<TARGET_PROPERTY:double-conversion,INCLUDE_DIRECTORIES>
    $<TARGET_PROPERTY:utf8proc,INCLUDE_DIRECTORIES>
    ${TORCH_INSTALL_PREFIX}/include
    ${TORCH_INSTALL_PREFIX}/include/torch/csrc/api/include
    )
```

--------------------------------

### Comparing TorchText and Hugging Face T5 Model Outputs - Python

Source: https://github.com/pytorch/text/blob/main/notebooks/hf_vs_tt_t5.ipynb

This snippet tokenizes the input and output sentences using the shared `transform` function. It then runs both the TorchText and Hugging Face T5 models with the tokenized inputs and asserts that their respective encoder and decoder outputs are identical, confirming consistency between the implementations.

```Python
tokenized_sentence = transform(input_sentence)
tokenized_output = transform(output_sentence)

tt_output = tt_t5_model(encoder_tokens=tokenized_sentence, decoder_tokens=tokenized_output)
hf_output = hf_t5_model(input_ids=tokenized_sentence, decoder_input_ids=tokenized_output, return_dict=True)

assert torch.all(tt_output["encoder_output"].eq(hf_output["encoder_last_hidden_state"]))
assert torch.all(tt_output["decoder_output"].eq(hf_output["last_hidden_state"]))
```

--------------------------------

### Building _torchtext Python Extension (CMake)

Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt

This snippet invokes the `define_extension` function to build the `_torchtext.so` Python extension. It uses the previously defined `EXTENSION_SOURCES`, `EXTENSION_INCLUDE_DIRS`, `EXTENSION_LINK_LIBRARIES`, and `LIBTORCHTEXT_COMPILE_DEFINITIONS` to configure the build, creating the Python-callable module.

```CMake
define_extension(
    _torchtext
    "${EXTENSION_SOURCES}"
    "${EXTENSION_INCLUDE_DIRS}"
    "${EXTENSION_LINK_LIBRARIES}"
    "${LIBTORCHTEXT_COMPILE_DEFINITIONS}"
    )
```

--------------------------------

### Defining T5 Prompt Constants and Helper Functions in Python

Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb

This snippet defines constants for common T5 tasks like summarization, translation, and question answering. It also provides helper functions to format input text according to the T5 model's expected prompt structure for these specific tasks.

```Python
SUMMERIZE_PROMP = "summarize"
TRANSLATE_TO_GERMAN = "translate English to German"
QUESTION_PROMPS = "question"
CONTEXT_PROMPT = "context"


def summarize_text(text):
    return f"{SUMMERIZE_PROMP}: {text}"


def en_to_german_text(text):
    return f"{TRANSLATE_TO_GERMAN}: {text}"


def qa_text(context, question):
    return f"{QUESTION_PROMPS}: {question}? {CONTEXT_PROMPT}: {context}"
```

--------------------------------

### Defining Python Extension Link Libraries (CMake)

Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt

This snippet lists the libraries that the `_torchtext.so` Python extension must link against. Crucially, it links `libtorchtext` itself, ensuring the extension can access the core C++ functionalities, along with other necessary dependencies implicitly handled by `libtorchtext`'s linkage.

```CMake
set(
    EXTENSION_LINK_LIBRARIES
    libtorchtext
  )
```

--------------------------------

### Performing Text Generation with T5 Model (Python)

Source: https://github.com/pytorch/text/blob/main/notebooks/hf_with_torchtext_gen.ipynb

This snippet demonstrates text generation using the T5 model. It initializes `GenerationUtil` with the T5 model, tokenizes a test sequence for summarization, and then generates output tokens. The generated tokens are finally decoded and printed, showcasing T5's ability to handle encoder-decoder tasks like summarization.

```python
# Testing Huggingface's T5
test_sequence = ["summarize: studies have shown that owning a dog is good for you"]
generative_hf_t5 = GenerationUtil(t5, is_encoder_decoder=True, is_huggingface_model=True)
t5_tokenizer = T5Tokenizer.from_pretrained("t5-base")
test_sequence_tk = t5_tokenizer(test_sequence, return_tensors="pt").input_ids
tokens = generative_hf_t5.generate(test_sequence_tk, max_len=20, pad_idx=t5.config.pad_token_id)
print(t5_tokenizer.batch_decode(tokens, skip_special_tokens=True))
```

--------------------------------

### Defining a New TorchText Dataset Function

Source: https://github.com/pytorch/text/blob/main/CONTRIBUTING_DATASETS.md

This snippet illustrates the foundational structure for defining a new dataset function within torchtext. It demonstrates the application of the @_create_dataset_directory decorator for managing dataset caching and the @_wrap_split_argument decorator for handling dataset splits (e.g., 'train', 'dev', 'test'). The function signature includes 'root' for the cache directory and 'split' for specifying data subsets, along with placeholders for additional necessary arguments.

```Python
DATASET_NAME = "MyDataName"

@_create_dataset_directory(dataset_name=DATASET_NAME)
@_wrap_split_argument(("train", "dev","test"))
def MyDataName(root: str, split: Union[Tuple[str], str], …):
    …
```

--------------------------------

### Adding Project Subdirectories

Source: https://github.com/pytorch/text/blob/main/CMakeLists.txt

Includes the `third_party` and `torchtext/csrc` directories as sub-projects, allowing CMake to process their respective `CMakeLists.txt` files and build their components.

```CMake
add_subdirectory(third_party)
add_subdirectory(torchtext/csrc)
```

--------------------------------

### Configuring macOS Specific Build Settings

Source: https://github.com/pytorch/text/blob/main/CMakeLists.txt

Applies macOS-specific configurations, including detecting the Clang version, enabling RPATH for shared libraries, and setting the shared library suffix to `.so`.

```CMake
if(APPLE)
  # Get clang version on macOS
  execute_process( COMMAND ${CMAKE_CXX_COMPILER} --version OUTPUT_VARIABLE clang_full_version_string )
  string(REGEX REPLACE "Apple LLVM version ([0-9]+\\.[0-9]+).*" "\\1" CLANG_VERSION_STRING ${clang_full_version_string})
  message( STATUS "CLANG_VERSION_STRING:         " ${CLANG_VERSION_STRING} )

  # RPATH stuff
  set(CMAKE_MACOSX_RPATH ON)

  set(CMAKE_SHARED_LIBRARY_SUFFIX ".so")
endif()
```

--------------------------------

### Performing Text Generation with BART Model (Python)

Source: https://github.com/pytorch/text/blob/main/notebooks/hf_with_torchtext_gen.ipynb

This code illustrates text generation using the BART model, specifically for summarization of a news article. It sets up `GenerationUtil` with the BART model, tokenizes the input text, and generates a summary. The output demonstrates BART's capabilities as an encoder-decoder model for abstractive summarization.

```python
# Testing Huggingface's BART
test_sequence = ["PG&E stated it scheduled the blackouts in response to forecasts for high winds "
    "amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were "
    "scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."]
generative_hf_bart = GenerationUtil(bart, is_encoder_decoder=True, is_huggingface_model=True)
bart_tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
test_sequence_tk = bart_tokenizer(test_sequence, return_tensors="pt").input_ids
tokens = generative_hf_bart.generate(test_sequence_tk, max_len=20, pad_idx=bart.config.pad_token_id)
print(bart_tokenizer.batch_decode(tokens, skip_special_tokens=True))
```

--------------------------------

### Defining Python Extension Source Files (CMake)

Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt

This snippet defines the C++ source files specifically for the `_torchtext.so` Python extension. These files (`register_pybindings.cpp`, `vocab_factory.cpp`) contain the necessary C++ code that exposes `libtorchtext` functionalities to Python via Pybind11 or similar binding mechanisms.

```CMake
set(
    EXTENSION_SOURCES
    register_pybindings.cpp
    vocab_factory.cpp
    )
```

--------------------------------

### Building libtorchtext Shared Library (CMake)

Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt

This snippet invokes the `define_library` function to build the `libtorchtext` shared library. It passes the previously defined variables for sources, include directories, link libraries, and compile definitions, centralizing the configuration for the main C++ library.

```CMake
define_library(
  libtorchtext
  "${LIBTORCHTEXT_SOURCES}"
  "${LIBTORCHTEXT_INCLUDE_DIRS}"
  "${LIBTORCHTEXT_LINK_LIBRARIES}"
  "${LIBTORCHTEXT_COMPILE_DEFINITIONS}"
  )
```

--------------------------------

### Inspecting Saved Files (Bash)

Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb

This command-line snippet, typically executed within an IPython or Jupyter environment, lists the files in the current directory in a long, human-readable, all-inclusive, and time-sorted format, then pipes the output to `head -3` to display only the first three lines. It's used to quickly verify the presence and details of the recently saved model file.

```Bash
!ls -lath | head -3
```

--------------------------------

### Initializing Hugging Face Pre-trained Models (Python)

Source: https://github.com/pytorch/text/blob/main/notebooks/hf_with_torchtext_gen.ipynb

This code initializes three different pre-trained Hugging Face models: T5 for conditional generation, BART for conditional generation (specifically a CNN-optimized version), and GPT2 for language modeling. These models are loaded from their respective pre-trained checkpoints, ready for use in text generation tasks.

```python
t5 = T5ForConditionalGeneration.from_pretrained("t5-base")
bart = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
gpt2 = GPT2LMHeadModel.from_pretrained("gpt2")
```

--------------------------------

### Locating PyTorch Core Libraries

Source: https://github.com/pytorch/text/blob/main/CMakeLists.txt

Uses `find_library` to locate essential PyTorch libraries (c10, torch, torch_cpu) within the specified `TORCH_INSTALL_PREFIX/lib` directory, which are dependencies for torchtext.

```CMake
find_library(TORCH_C10_LIBRARY c10 PATHS "${TORCH_INSTALL_PREFIX}/lib")
find_library(TORCH_LIBRARY torch PATHS "${TORCH_INSTALL_PREFIX}/lib")
find_library(TORCH_CPU_LIBRARY torch_cpu PATHS "${TORCH_INSTALL_PREFIX}/lib")
```

--------------------------------

### Adding SentencePiece Library as CMake Subdirectory

Source: https://github.com/pytorch/text/blob/main/third_party/CMakeLists.txt

This command includes the `sentencepiece` project as a subdirectory. SentencePiece is an unsupervised text tokenizer and detokenizer, commonly used in neural network-based text processing tasks. `EXCLUDE_FROM_ALL` prevents it from being built by default.

```CMake
add_subdirectory(sentencepiece EXCLUDE_FROM_ALL)
```

--------------------------------

### Running C++ Tokenizer Application (Bash)

Source: https://github.com/pytorch/text/blob/main/examples/libtorchtext/tokenizer/README.md

This snippet executes the compiled C++ tokenizer application, passing the previously created TorchScript tokenizer file as an argument. It processes an input sentence and verifies that the output matches the expected result from the Python script.

```bash
./build/tokenizer/tokenize "tokenizer/${tokenizer_file}"
```

--------------------------------

### Configuring MSVC Runtime Library

Source: https://github.com/pytorch/text/blob/main/CMakeLists.txt

For MSVC compilers, this snippet sets the `CMAKE_MSVC_RUNTIME_LIBRARY` to `MultiThreaded` with a debug variant for debug configurations, ensuring correct linking with the C runtime.

```CMake
if(MSVC)
  set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreaded$<$<CONFIG:Debug>:Debug>")
endif()
```

--------------------------------

### Adding utf8proc Library as CMake Subdirectory

Source: https://github.com/pytorch/text/blob/main/third_party/CMakeLists.txt

This command includes the `utf8proc` project as a subdirectory. `utf8proc` is a small, clean C library for processing UTF-8 Unicode data, offering functions for normalization, case-folding, and character properties. `EXCLUDE_FROM_ALL` prevents it from being built by default.

```CMake
add_subdirectory(utf8proc EXCLUDE_FROM_ALL)
```

--------------------------------

### Importing Pre-trained T5 Generation Models from PyTorch Text

Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb

This snippet imports various pre-trained T5 generation model bundles (small, large, 3B, 11B parameters) from the `torchtext.models` module. These bundles provide configurations and checkpoints for different scales of the T5 model.

```Python
from torchtext.models import T5_SMALL_GENERATION, T5_LARGE_GENERATION, T5_3B_GENERATION, T5_11B_GENERATION
```

--------------------------------

### Measuring T5 Model Inference Time on CPU using IPython

Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb

This IPython magic command measures the execution time of the JIT-compiled T5-Large model (`t5_large`) when performing inference on the `EXAMPLE_INPUT` with a maximum output length of 100 tokens. This demonstrates CPU performance.

```Python
%time t5_large(EXAMPLE_INPUT, max_length=100)
```

--------------------------------

### Adding Progress Bars with TQDM (Python)

Source: https://github.com/pytorch/text/blob/main/requirements.txt

This snippet specifies the 'tqdm' library, which is used to display smart progress bars for iterators in Python applications, providing visual feedback during long-running operations.

```Python
tqdm
```

--------------------------------

### Enabling Compile Commands and PIC

Source: https://github.com/pytorch/text/blob/main/CMakeLists.txt

Enables the generation of `compile_commands.json` for tooling and sets `CMAKE_POSITION_INDEPENDENT_CODE` to ON, which is crucial for shared libraries.

```CMake
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
```

--------------------------------

### Integrating Optional NLP Tools (Python)

Source: https://github.com/pytorch/text/blob/main/requirements.txt

This section lists optional Natural Language Processing (NLP) tools including NLTK, spaCy, and sacremoses, along with a specific Git repository for 'revtok', providing advanced text processing capabilities.

```Python
nltk
spacy
sacremoses
git+https://github.com/jekbradbury/revtok.git
```

--------------------------------

### Downloading Files with Requests (Python)

Source: https://github.com/pytorch/text/blob/main/requirements.txt

This snippet includes the 'requests' library, a popular HTTP library for Python, used for making web requests to download data and other files from the internet.

```Python
requests
```

--------------------------------

### Configuring C/C++ Standard Versions

Source: https://github.com/pytorch/text/blob/main/CMakeLists.txt

Checks if a C++ standard is already defined in environment variables and warns the user if it conflicts with the required C++17. It then explicitly sets C++ standard to 17 and C standard to 11 for the project.

```CMake
string(FIND "${CMAKE_CXX_FLAGS}" "-std=c++" env_cxx_standard)
if(env_cxx_standard GREATER -1)
  message(
      WARNING "C++ standard version definition detected in environment variable."
      "PyTorch requires -std=c++17. Please remove -std=c++ settings in your environment.")
endif()

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_C_STANDARD 11)
```

--------------------------------

### Demonstrating TorchScriptability Issue with TorchText T5 and GenerationUtils (Python)

Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb

This snippet illustrates the problem of GenerationUtils breaking TorchScript compatibility for TorchText's T5 model. It shows that the tokenizer and the base T5 model are initially TorchScriptable, but wrapping the model with GenerationUtils prevents it from being JIT-scripted, leading to an exception. The failure is attributed to **kwargs, optional values, and multiple return types.

```Python
%load_ext autoreload
%autoreload 2

import torch
from torchtext.prototype.generate import GenerationUtils
from torchtext.models import T5_SMALL_GENERATION

# The tokenizer object is torchscriptable
tokenizer = T5_SMALL_GENERATION.transform()
tokenizer_jit = torch.jit.script(tokenizer)

# The T5 model is also torchscriptable
model = T5_SMALL_GENERATION.get_model()
model_jit = torch.jit.script(model)


# But after wrapping with GenerationUtils, the model is no longer torchscriptable
generative_model = GenerationUtils(model)
generative_model_jit = torch.jit.script(generative_model)
```

--------------------------------

### Measuring T5 Model Inference Time on GPU using IPython

Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb

This snippet first loads a JIT-compiled T5-Large model configured for CUDA (GPU) acceleration. It then uses an IPython magic command to measure the inference time on the GPU, allowing for a comparison of performance between CPU and GPU.

```Python
# Try to load to GPU and compare the time difference 
t5_large_gpu = get_jit_from_bundle(T5_LARGE_GENERATION, cuda=True)
%time t5_large_gpu(EXAMPLE_INPUT, max_length=100)
```

--------------------------------

### Appending C++ Compiler Flags

Source: https://github.com/pytorch/text/blob/main/CMakeLists.txt

Appends additional C++ compiler flags, including the C++ ABI definition for compatibility, `-Wall` for all warnings, and any existing `TORCH_CXX_FLAGS`.

```CMake
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_COMPILED_WITH_CXX_ABI} -Wall ${TORCH_CXX_FLAGS}")
```

--------------------------------

### Configuring DataLoader for Multi-processing with torchtext.datasets (Python)

Source: https://github.com/pytorch/text/blob/main/docs/source/datasets.rst

This snippet demonstrates how to properly configure `torch.utils.data.DataLoader` for multi-processing when working with `torchtext` datapipes. By using `worker_init_fn` from `torch.utils.data.backward_compatibility`, it ensures that data is not duplicated across workers. The `drop_last=True` parameter is also recommended to maintain consistent batch sizes.

```Python
from torch.utils.data.backward_compatibility import worker_init_fn
DataLoader(dp, num_workers=4, worker_init_fn=worker_init_fn, drop_last=True)
```