### Building Libtorchtext and Examples Source: https://github.com/pytorch/text/blob/main/examples/libtorchtext/README.md This bash script provides the commands to build `libtorchtext` and its associated example applications. It first grants execute permission to `build.sh` and then runs the script to initiate the build process. ```bash chmod +x build.sh # give script execute permission ./build.sh ``` -------------------------------- ### Building torchtext from Source on Linux Source: https://github.com/pytorch/text/blob/main/README.rst This command compiles and installs torchtext from its source code on Linux systems. It cleans previous builds and performs a fresh installation. ```Shell python setup.py clean install ``` -------------------------------- ### Installing torchtext using Pip Source: https://github.com/pytorch/text/blob/main/README.rst This command installs the torchtext library using the pip package installer. This is a standard Python package installation method. ```Shell pip install torchtext ``` -------------------------------- ### Installing SacreMoses Tokenizer Source: https://github.com/pytorch/text/blob/main/README.rst This command installs the SacreMoses library, which provides a port of the Moses tokenizer. It is an optional dependency for using the Moses tokenizer with torchtext. ```Shell pip install sacremoses ``` -------------------------------- ### Running Basic Text Classification Training Script (Bash) Source: https://github.com/pytorch/text/blob/main/examples/text_classification/README.md This snippet executes a shell script to run the basic text classification model training. It's typically used for a quick start or to run a predefined training pipeline, likely involving the AG_NEWS dataset as mentioned in the surrounding text. ```bash ./run_script.sh ``` -------------------------------- ### Configuring libtorchtext C++ Example with CMake Source: https://github.com/pytorch/text/blob/main/examples/libtorchtext/CMakeLists.txt This snippet defines the minimum CMake version, sets the project name, and configures build options for a C++ example using libtorchtext. It finds the required Torch package, applies its C++ flags, and includes libtorchtext and tokenizer subdirectories. ```CMake cmake_minimum_required(VERSION 3.18 FATAL_ERROR) project(libtorchtext_cpp_example) SET(BUILD_TORCHTEXT_PYTHON_EXTENSION OFF CACHE BOOL "Build Python binding") find_package(Torch REQUIRED) message("libtorchtext CMakeLists: ${TORCH_CXX_FLAGS}") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}") add_subdirectory(../.. libtorchtext) add_subdirectory(tokenizer) ``` -------------------------------- ### Preparing Example Input and Loading a JIT-Compiled T5 Model Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb This snippet defines a list of example input strings formatted for T5 tasks (question answering and summarization). It then loads a JIT-compiled T5-Large generation model using the `get_jit_from_bundle` utility function, preparing it for inference. ```Python EXAMPLE_INPUT = [ 'question: What does Nir likes to eat? context: Nir is a PM on the Care AI team. Nir only eats vegeterian food and he loves Pizza', 'question: Who likes to eat pizza? context: Nir is a PM on the Care AI team. Nir only eats vegeterian food and he loves Pizza', "summarize: studies say that owning a dog is good for you", ] t5_large = get_jit_from_bundle(T5_LARGE_GENERATION) ``` -------------------------------- ### Building torchtext from Source on OSX Source: https://github.com/pytorch/text/blob/main/README.rst This command compiles and installs torchtext from its source code on OSX systems, explicitly using clang as the C++ compiler. It cleans previous builds and performs a fresh installation. ```Shell CC=clang CXX=clang++ python setup.py clean install ``` -------------------------------- ### Installing SpaCy and English Model Source: https://github.com/pytorch/text/blob/main/README.rst These commands install the SpaCy library and download its small English language model, which is required if you intend to use SpaCy's English tokenizer with torchtext. ```Shell pip install spacy python -m spacy download en_core_web_sm ``` -------------------------------- ### Cloning and Initializing torchtext Source Repository Source: https://github.com/pytorch/text/blob/main/README.rst These commands clone the torchtext repository from GitHub and initialize its submodules, which are necessary for building the library from source. ```Shell git clone https://github.com/pytorch/text torchtext cd torchtext git submodule update --init --recursive ``` -------------------------------- ### Training Text Classification Model with SentencePiece and YelpReviewFull (Python) Source: https://github.com/pytorch/text/blob/main/examples/text_classification/README.md This command initiates the training of a text classification model using Python. It specifies the 'YelpReviewFull' dataset, utilizes a CUDA-enabled device, enables SentencePiece tokenization, sets the number of training epochs to 10, and configures the embedding dimension to 64. This setup aims to reproduce fastText results. ```python python train.py YelpReviewFull --device cuda --use-sp-tokenizer True --num-epochs 10 --embed-dim 64 ``` -------------------------------- ### Installing SentencePiece for older torchtext versions Source: https://github.com/pytorch/text/blob/main/README.rst This command installs the SentencePiece library using Conda. It is specifically required for torchtext versions 0.5 and below for subword tokenization. ```Shell conda install -c powerai sentencepiece ``` -------------------------------- ### Developing torchtext from Source Source: https://github.com/pytorch/text/blob/main/README.rst This command installs torchtext in 'develop' mode, which links the installed package to the source directory. This is useful for developers making modifications to the library without needing to reinstall after every change. ```Shell python setup.py develop ``` -------------------------------- ### Installing torchtext using Conda Source: https://github.com/pytorch/text/blob/main/README.rst This command installs the torchtext library using the Conda package manager from the PyTorch channel. It is a recommended method for managing Python packages and their dependencies. ```Shell conda install -c pytorch torchtext ``` -------------------------------- ### Conditional Python Extension Build Setup (CMake) Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt This snippet conditionally sets up the build process for the `_torchtext.so` Python extension, activated by `BUILD_TORCHTEXT_PYTHON_EXTENSION`. It finds the `torch_python` library and, for Windows, ensures Python Development components are available, which are prerequisites for building Python extensions. ```CMake if (BUILD_TORCHTEXT_PYTHON_EXTENSION) # See https://github.com/pytorch/pytorch/issues/38122 find_library(TORCH_PYTHON_LIBRARY torch_python PATHS "${TORCH_INSTALL_PREFIX}/lib") if (WIN32) find_package(Python3 ${PYTHON_VERSION} EXACT COMPONENTS Development) set(ADDITIONAL_ITEMS Python3::Python) endif() ``` -------------------------------- ### Setting CMake Module Path and Torch Prefixes Source: https://github.com/pytorch/text/blob/main/CMakeLists.txt Extends the CMake module search path, defines `TORCH_INSTALL_PREFIX` for locating PyTorch installations, and sets `TORCH_COMPILED_WITH_CXX_ABI` to ensure ABI compatibility with PyTorch. ```CMake set(CMAKE_MODULE_PATH "${CMAKE_MODULE_PATH};${CMAKE_CURRENT_SOURCE_DIR}/cmake") set(TORCH_INSTALL_PREFIX "${CMAKE_PREFIX_PATH}/../.." CACHE STRING "Install path for torch") set(TORCH_COMPILED_WITH_CXX_ABI "-D_GLIBCXX_USE_CXX11_ABI=0" CACHE STRING "Compile torchtext with cxx11_abi") ``` -------------------------------- ### Installing pre-commit for Python Code Formatting (conda) Source: https://github.com/pytorch/text/blob/main/CONTRIBUTING.md This command installs the `pre-commit` tool using conda from the `conda-forge` channel. It serves the same purpose as the pip installation, providing the necessary tool for `torchtext`'s code style enforcement and pre-commit hooks. ```shell conda install -c conda-forge pre-commit ``` -------------------------------- ### Installing TorchArrow with PyTorch Dependency (Bash) Source: https://github.com/pytorch/text/blob/main/examples/torcharrow/README.md This command installs TorchArrow from source, ensuring that the `USE_TORCH=1` flag is set. This flag is crucial for enabling natively integrated text operators like `bpe_tokenize` and `lookup_indices` which depend on the PyTorch library, as TorchArrow does not include this dependency by default. ```Bash USE_TORCH=1 python setup.py install ``` -------------------------------- ### Defining libtorchtext Include Directories (CMake) Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt This snippet specifies the necessary include directories for compiling the `libtorchtext` library. It includes paths to the project's source, third-party dependencies like SentencePiece, re2, double-conversion, utf8proc, and PyTorch installation directories for API headers. ```CMake set( LIBTORCHTEXT_INCLUDE_DIRS ${PROJECT_SOURCE_DIR} ${PROJECT_SOURCE_DIR}/third_party/sentencepiece/src $ $ $ ${TORCH_INSTALL_PREFIX}/include ${TORCH_INSTALL_PREFIX}/include/torch/csrc/api/include ) ``` -------------------------------- ### Loading and Tokenizing IMDB Dataset with torchtext.datasets (Python) Source: https://github.com/pytorch/text/blob/main/docs/source/datasets.rst This example illustrates how to load the IMDB dataset using `torchtext.datasets.IMDB` and iterate through its elements. It shows a basic tokenization function applied to each line, demonstrating how to process the `(label, line)` pairs yielded by the dataset iterator. The `split='train'` argument specifies the dataset partition to load. ```Python # import datasets from torchtext.datasets import IMDB train_iter = IMDB(split='train') def tokenize(label, line): return line.split() tokens = [] for label, line in train_iter: tokens += tokenize(label, line) ``` -------------------------------- ### Installing pre-commit for Python Code Formatting (pip) Source: https://github.com/pytorch/text/blob/main/CONTRIBUTING.md This command installs the `pre-commit` tool using pip, which is used to enforce code style for Python, text, and configuration files in `torchtext`. It's a prerequisite for automatically checking and fixing code format before committing changes. ```shell pip install pre-commit ``` -------------------------------- ### Defining Python Project Dependencies Source: https://github.com/pytorch/text/blob/main/docs/requirements.txt This snippet specifies the Python packages and their exact versions required for the project. It includes standard packages like Jinja2, Sphinx, matplotlib, and regex, along with a direct installation from a Git repository for the PyTorch Sphinx theme. ```Python Jinja2<3.1.0 sphinx==5.1.1 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git@cece053#egg=pytorch_sphinx_theme sphinx_gallery==0.11.1 matplotlib regex ``` -------------------------------- ### Defining Python Extension Build Function (CMake) Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt This CMake function `define_extension` defines how to build a Python shared extension module. It configures the library with specified sources, include directories (including Python's), link libraries (including `torch_python`), and compile definitions. It also handles platform-specific properties like `.pyd` suffix for MSVC and `LINK_FLAGS` for Apple, and sets installation rules. ```CMake function(define_extension name sources include_dirs link_libraries definitions) add_library(${name} SHARED ${sources}) target_compile_definitions(${name} PRIVATE "${definitions}") target_include_directories( ${name} PRIVATE ${Python_INCLUDE_DIR} ${include_dirs}) target_link_libraries( ${name} ${link_libraries} ${TORCH_PYTHON_LIBRARY} ${ADDITIONAL_ITEMS} ) set_target_properties(${name} PROPERTIES PREFIX "") if (MSVC) set_target_properties(${name} PROPERTIES SUFFIX ".pyd") endif(MSVC) if (APPLE) # https://github.com/facebookarchive/caffe2/issues/854#issuecomment-364538485 # https://github.com/pytorch/pytorch/commit/73f6715f4725a0723d8171d3131e09ac7abf0666 set_target_properties(${name} PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") endif() install( TARGETS ${name} LIBRARY DESTINATION . RUNTIME DESTINATION . # For Windows ) endfunction() ``` -------------------------------- ### Defining Shared Library Build Function (CMake) Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt This CMake function `define_library` encapsulates the logic for building a shared library. It takes the library name, source files, include directories, link libraries, and compile definitions as arguments, then configures the library properties, including setting a `.pyd` suffix for MSVC builds and defining installation rules. ```CMake function (define_library name source include_dirs link_libraries compile_defs) add_library(${name} SHARED ${source}) target_include_directories(${name} PRIVATE ${include_dirs}) target_link_libraries(${name} ${link_libraries}) target_compile_definitions(${name} PRIVATE ${compile_defs}) set_target_properties(${name} PROPERTIES PREFIX "") if (MSVC) set_target_properties(${name} PROPERTIES SUFFIX ".pyd") endif(MSVC) install( TARGETS ${name} LIBRARY DESTINATION lib RUNTIME DESTINATION lib # For Windows ) endfunction() ``` -------------------------------- ### Initializing T5 Models and Data - Python Source: https://github.com/pytorch/text/blob/main/notebooks/hf_vs_tt_t5.ipynb This snippet initializes input and output sentences for translation, then prepares both TorchText and Hugging Face T5 models. It obtains the `transform` function and the TorchText T5 model (`tt_t5_model`) from `T5_BASE`, and loads the Hugging Face T5 base model (`hf_t5_model`) using `T5Model.from_pretrained`. ```Python input_sentence = ["translate to Spanish: My name is Joe"] output_sentence = ["Me llamo Joe"] transform = T5_BASE.transform() tt_t5_model = T5_BASE.get_model() hf_t5_model = T5Model.from_pretrained("t5-base") ``` -------------------------------- ### Utility Functions for Loading and JIT Compiling T5 Models in PyTorch Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb This collection of functions provides a structured way to build, load, and prepare T5 models from bundles. It includes `build_model` for flexible model instantiation, `load_model` for loading from a bundle, `get_model_from_bundle` for creating a `TorchScriptableT5` instance, and `get_jit_from_bundle` for obtaining a JIT-compiled version, optionally with CUDA support. ```Python from typing import Optional, Union, Dict, Any from torchtext import _TEXT_BUCKET from urllib.parse import urljoin from torchtext._download_hooks import load_state_dict_from_url def build_model( config: T5Conf, T5Class=T5Model, freeze_model: bool = False, checkpoint: Optional[Union[str, Dict[str, torch.Tensor]]] = None, strict: bool = False, dl_kwargs: Optional[Dict[str, Any]] = None, ) -> T5Model: """Class builder method that can overide the default T5Model model class (reference: https://github.com/pytorch/text/blob/a1dc61b8e80df70fe7a35b9f5f5cc7e19c7dd8a3/torchtext/models/t5/bundler.py#L113) Args: config (T5Conf): An instance of classT5Conf that defined the model configuration freeze_model (bool): Indicates whether to freeze the model weights. (Default: `False`) checkpoint (str or Dict[str, torch.Tensor]): Path to or actual model state_dict. state_dict can have partial weights i.e only for encoder. (Default: ``None``) strict (bool): Passed to :func: `torch.nn.Module.load_state_dict` method. (Default: `False`) dl_kwargs (dictionary of keyword arguments): Passed to :func:`torch.hub.load_state_dict_from_url`. (Default: `None`) """ model = T5Class(config, freeze_model) if checkpoint is not None: if torch.jit.isinstance(checkpoint, Dict[str, torch.Tensor]): state_dict = checkpoint elif isinstance(checkpoint, str): dl_kwargs = {} if dl_kwargs is None else dl_kwargs state_dict = load_state_dict_from_url(checkpoint, **dl_kwargs) else: raise TypeError( "checkpoint must be of type `str` or `Dict[str, torch.Tensor]` but got {}".format(type(checkpoint)) ) model.load_state_dict(state_dict, strict=strict) return model def load_model(bundle, T5Class=T5TorchGenerative): """ Example usage: >> model = load_model(bundle=T5_SMALL_GENERATION, T5Class=T5TorchGenerative) """ return build_model(config=bundle.config, T5Class=T5Class, checkpoint=bundle._path) def get_model_from_bundle(bundle, cuda=False): model = load_model(bundle=bundle, T5Class=T5TorchGenerative) tokenizer = bundle.transform() full_model = TorchScriptableT5(model=model, transform=tokenizer, cuda=cuda) return full_model def get_jit_from_bundle(bundle, cuda=False): full_model = get_model_from_bundle(bundle, cuda=cuda) full_model_jit = torch.jit.script(full_model) return full_model_jit ``` -------------------------------- ### Downloading GPT2 BPE Tokenizer Artifacts (Bash) Source: https://github.com/pytorch/text/blob/main/examples/libtorchtext/tokenizer/README.md This snippet downloads the necessary `gpt2_bpe_vocab.bpe` and `gpt2_bpe_encoder.json` files, which are prerequisites for constructing the `GPT2BPETokenizer` object in subsequent steps. ```bash curl -O https://download.pytorch.org/models/text/gpt2_bpe_vocab.bpe curl -O https://download.pytorch.org/models/text/gpt2_bpe_encoder.json ``` -------------------------------- ### Running RoBERTa SST-2 Training Script (Bash) Source: https://github.com/pytorch/text/blob/main/examples/torcharrow/README.md This command executes the `roberta_sst2_training_with_torcharrow.py` script, initiating the end-to-end training process for SST-2 binary classification. It configures the training with a batch size of 16, runs for 1 epoch, and sets the learning rate to 1e-5, demonstrating the usage of the TorchArrow-based pipeline. ```Bash python roberta_sst2_training_with_torcharrow.py \ --batch-size 16 \ --num-epochs 1 \ --learning-rate 1e-5 ``` -------------------------------- ### Dynamically Modifying Tutorial and GitHub Links (JavaScript) Source: https://github.com/pytorch/text/blob/main/docs/source/_templates/layout.html This JavaScript snippet, executed on document ready, dynamically updates the 'Run in Google Colab', 'Download Notebook', and 'View on GitHub' links for tutorials. It also overwrites the main 'GitHub' link in the navigation menu to point to the `pytorch/text` repository, ensuring correct resource access for users. ```JavaScript var collapsedSections = []; $(document).ready(function() { var downloadNote = $(".sphx-glr-download-link-note.admonition.note"); if (downloadNote.length >= 1) { var tutorialUrl = $("#tutorial-type").text(); var githubLink = "https://github.com/pytorch/text/blob/main/examples/" + tutorialUrl + ".py", notebookLink = $(".reference.download")[1].href, notebookDownloadPath = notebookLink.split('_downloads')[1], colabLink = "https://colab.research.google.com/github/pytorch/text/blob/gh-pages/main/_downloads" + notebookDownloadPath; $(".pytorch-call-to-action-links a[data-response='Run in Google Colab']").attr("href", colabLink); $(".pytorch-call-to-action-links a[data-response='View on Github']").attr("href", githubLink); } // Overwrite the link to GitHub project var overwrite = function(_) { if ($(this).length > 0) { $(this)[0].href = "https://github.com/pytorch/text" } } // PC $(".main-menu a:contains('GitHub')").each(overwrite); // Mobile $(".main-menu a:contains('Github')").each(overwrite); }); ``` -------------------------------- ### Running Unit Tests with Pytest (Python) Source: https://github.com/pytorch/text/blob/main/requirements.txt This snippet lists testing frameworks and libraries, including 'pytest' for running unit tests, 'expecttest' for snapshot testing, and 'parameterized' for creating parameterized test cases, all essential for ensuring code quality. ```Python pytest expecttest parameterized ``` -------------------------------- ### T5 Model Construction and Loading Utilities in PyTorch Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb This snippet provides a set of utility functions for building and loading T5 models in PyTorch. The `build_model` function allows for flexible model instantiation from a configuration and supports loading partial or full model checkpoints. `load_model` simplifies the process by loading a model directly from a bundle, while `get_model_from_bundle` and `get_jit_from_bundle` further encapsulate the creation of a full, scriptable T5 model with its associated tokenizer. ```python from typing import Optional, Union, Dict, Any from torchtext import _TEXT_BUCKET from urllib.parse import urljoin from torchtext._download_hooks import load_state_dict_from_url def build_model( config: T5Conf, T5Class=T5Model, freeze_model: bool = False, checkpoint: Optional[Union[str, Dict[str, torch.Tensor]]] = None, strict: bool = False, dl_kwargs: Optional[Dict[str, Any]] = None, ) -> T5Model: """Class builder method that can overide the default T5Model model class (reference: https://github.com/pytorch/text/blob/a1dc61b8e80df70fe7a35b9f5f5cc7e19c7dd8a3/torchtext/models/t5/bundler.py#L113) Args: config (T5Conf): An instance of classT5Conf that defined the model configuration freeze_model (bool): Indicates whether to freeze the model weights. (Default: `False`) checkpoint (str or Dict[str, torch.Tensor]): Path to or actual model state_dict. state_dict can have partial weights i.e only for encoder. (Default: ``None``) strict (bool): Passed to :func: `torch.nn.Module.load_state_dict` method. (Default: `False`) dl_kwargs (dictionary of keyword arguments): Passed to :func:`torch.hub.load_state_dict_from_url`. (Default: `None`) """ model = T5Class(config, freeze_model) if checkpoint is not None: if torch.jit.isinstance(checkpoint, Dict[str, torch.Tensor]): state_dict = checkpoint elif isinstance(checkpoint, str): dl_kwargs = {} if dl_kwargs is None else dl_kwargs state_dict = load_state_dict_from_url(checkpoint, **dl_kwargs) else: raise TypeError( "checkpoint must be of type `str` or `Dict[str, torch.Tensor]` but got {}".format(type(checkpoint)) ) model.load_state_dict(state_dict, strict=strict) return model def load_model(bundle, T5Class=T5TorchGenerative): """ Example usage: >> model = load_model(bundle=T5_SMALL_GENERATION, T5Class=T5TorchGenerative) """ return build_model(config=bundle.config, T5Class=T5Class, checkpoint=bundle._path) def get_model_from_bundle(bundle): model = load_model(bundle=bundle, T5Class=T5TorchGenerative) tokenizer = bundle.transform() full_model = TorchScriptableT5(model=model, transform=tokenizer) return full_model def get_jit_from_bundle(bundle): full_model = get_model_from_bundle(bundle) ``` -------------------------------- ### Importing Libraries for T5 Model Comparison - Python Source: https://github.com/pytorch/text/blob/main/notebooks/hf_vs_tt_t5.ipynb This snippet imports necessary libraries for comparing TorchText's T5 model with Hugging Face's T5 model. It includes `T5Model` from `transformers` for the Hugging Face implementation, `T5_BASE` from `torchtext.prototype.models` for the TorchText implementation, and `torch` for tensor operations and assertions. ```Python from transformers import T5Model from torchtext.prototype.models import T5_BASE import torch ``` -------------------------------- ### Importing Hugging Face and TorchText Generation Utilities (Python) Source: https://github.com/pytorch/text/blob/main/notebooks/hf_with_torchtext_gen.ipynb This snippet imports necessary classes from the `transformers` library for various models (T5, BART, GPT2) and their tokenizers, along with `GenerationUtil` from `torchtext.prototype.generate`, which is used for abstracting the generation process. It sets up the required dependencies for subsequent model operations. ```python from transformers import T5ForConditionalGeneration, T5Tokenizer, BartForConditionalGeneration, BartTokenizer, GPT2LMHeadModel, GPT2Tokenizer from torchtext.prototype.generate import GenerationUtil ``` -------------------------------- ### JIT Compiling a PyTorch Model Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb This snippet demonstrates how to compile a PyTorch model using `torch.jit.script`. JIT compilation optimizes the model for deployment and can improve performance by tracing or scripting the model's execution graph. ```Python full_model_jit = torch.jit.script(full_model) return full_model_jit ``` -------------------------------- ### Generating Documentation with Sphinx (Python) Source: https://github.com/pytorch/text/blob/main/requirements.txt This snippet specifies 'Sphinx', a widely used documentation generator that creates intelligent and beautiful documentation from reStructuredText or Markdown sources for Python projects. ```Python Sphinx ``` -------------------------------- ### Adding double-conversion Library as CMake Subdirectory Source: https://github.com/pytorch/text/blob/main/third_party/CMakeLists.txt This command includes the `double-conversion` project as a subdirectory. This library provides highly optimized and accurate conversions between floating-point numbers and their string representations. `EXCLUDE_FROM_ALL` prevents it from being built by default. ```CMake add_subdirectory(double-conversion EXCLUDE_FROM_ALL) ``` -------------------------------- ### Defining libtorchtext Source Files (CMake) Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt This snippet defines the C++ source files required to build the `libtorchtext` library. These files contain the core implementations of tokenizers, common utilities, and bindings for PyTorch. It lists various `.cpp` files like `clip_tokenizer.cpp`, `gpt2_bpe_tokenizer.cpp`, and `vocab.cpp`. ```CMake set( LIBTORCHTEXT_SOURCES clip_tokenizer.cpp common.cpp gpt2_bpe_tokenizer.cpp regex.cpp regex_tokenizer.cpp register_torchbindings.cpp sentencepiece.cpp vectors.cpp vocab.cpp bert_tokenizer.cpp ) ``` -------------------------------- ### Running clang-format for C++ Code Formatting Source: https://github.com/pytorch/text/blob/main/CONTRIBUTING.md This command executes the `run-clang-format.py` script to recursively format C++ files within the `torchtext/csrc` directory. It requires the path to the `clang-format` executable to be provided via the `$CLANG_FORMAT` environment variable, ensuring strict C++ code style enforcement. ```shell python run-clang-format.py \ --recursive \ --clang-format-executable=$CLANG_FORMAT \ torchtext/csrc ``` -------------------------------- ### Configuring Executable and Linking Libraries in CMake Source: https://github.com/pytorch/text/blob/main/examples/libtorchtext/tokenizer/CMakeLists.txt This CMake configuration defines an executable named 'tokenize' from 'main.cpp', links it against the PyTorch and TorchText libraries, and sets the C++ standard to C++14 for compilation. These steps are essential for building C++ applications that interact with PyTorch and TorchText functionalities. ```CMake add_executable(tokenize main.cpp) target_link_libraries(tokenize "${TORCH_LIBRARIES}" "${TORCHTEXT_LIBRARY}") set_property(TARGET tokenize PROPERTY CXX_STANDARD 14) ``` -------------------------------- ### Creating TorchScript Tokenizer File (Bash) Source: https://github.com/pytorch/text/blob/main/examples/libtorchtext/tokenizer/README.md This snippet executes a Python script to create a TorchScript object of the tokenizer, saving it to a specified file. It also verifies the tokenizer's output before and after saving/reloading, preparing it for use in a C++ application. ```bash tokenizer_file="tokenizer.pt" python create_tokenizer.py --tokenizer-file "${tokenizer_file}" ``` -------------------------------- ### Performing Text Generation with GPT2 Model (Python) Source: https://github.com/pytorch/text/blob/main/notebooks/hf_with_torchtext_gen.ipynb This snippet showcases text generation using the GPT2 model, which is a decoder-only model. It configures `GenerationUtil` for GPT2, tokenizes an input prompt, and generates a continuation of the sequence. The output demonstrates GPT2's ability to complete sentences and generate coherent text based on a given prefix. ```python # Testing Huggingface's GPT2 test_sequence = ["I enjoy walking with my cute dog"] generative_hf_gpt2 = GenerationUtil(gpt2, is_encoder_decoder=False, is_huggingface_model=True) gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2") test_sequence_tk = gpt2_tokenizer(test_sequence, return_tensors="pt").input_ids tokens = generative_hf_gpt2.generate(test_sequence_tk, max_len=20, pad_idx=gpt2.config.pad_token_id) print(gpt2_tokenizer.batch_decode(tokens, skip_special_tokens=True)) ``` -------------------------------- ### Defining Python Extension Include Directories (CMake) Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt This snippet specifies the include directories required for compiling the `_torchtext.so` Python extension. It mirrors many of the `libtorchtext` includes, ensuring access to common utilities and third-party dependencies, along with PyTorch headers for API integration. ```CMake set( EXTENSION_INCLUDE_DIRS ${PROJECT_SOURCE_DIR} ${PROJECT_SOURCE_DIR}/third_party/sentencepiece/src $ $ $ ${TORCH_INSTALL_PREFIX}/include ${TORCH_INSTALL_PREFIX}/include/torch/csrc/api/include ) ``` -------------------------------- ### Comparing TorchText and Hugging Face T5 Model Outputs - Python Source: https://github.com/pytorch/text/blob/main/notebooks/hf_vs_tt_t5.ipynb This snippet tokenizes the input and output sentences using the shared `transform` function. It then runs both the TorchText and Hugging Face T5 models with the tokenized inputs and asserts that their respective encoder and decoder outputs are identical, confirming consistency between the implementations. ```Python tokenized_sentence = transform(input_sentence) tokenized_output = transform(output_sentence) tt_output = tt_t5_model(encoder_tokens=tokenized_sentence, decoder_tokens=tokenized_output) hf_output = hf_t5_model(input_ids=tokenized_sentence, decoder_input_ids=tokenized_output, return_dict=True) assert torch.all(tt_output["encoder_output"].eq(hf_output["encoder_last_hidden_state"])) assert torch.all(tt_output["decoder_output"].eq(hf_output["last_hidden_state"])) ``` -------------------------------- ### Building _torchtext Python Extension (CMake) Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt This snippet invokes the `define_extension` function to build the `_torchtext.so` Python extension. It uses the previously defined `EXTENSION_SOURCES`, `EXTENSION_INCLUDE_DIRS`, `EXTENSION_LINK_LIBRARIES`, and `LIBTORCHTEXT_COMPILE_DEFINITIONS` to configure the build, creating the Python-callable module. ```CMake define_extension( _torchtext "${EXTENSION_SOURCES}" "${EXTENSION_INCLUDE_DIRS}" "${EXTENSION_LINK_LIBRARIES}" "${LIBTORCHTEXT_COMPILE_DEFINITIONS}" ) ``` -------------------------------- ### Defining T5 Prompt Constants and Helper Functions in Python Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb This snippet defines constants for common T5 tasks like summarization, translation, and question answering. It also provides helper functions to format input text according to the T5 model's expected prompt structure for these specific tasks. ```Python SUMMERIZE_PROMP = "summarize" TRANSLATE_TO_GERMAN = "translate English to German" QUESTION_PROMPS = "question" CONTEXT_PROMPT = "context" def summarize_text(text): return f"{SUMMERIZE_PROMP}: {text}" def en_to_german_text(text): return f"{TRANSLATE_TO_GERMAN}: {text}" def qa_text(context, question): return f"{QUESTION_PROMPS}: {question}? {CONTEXT_PROMPT}: {context}" ``` -------------------------------- ### Defining Python Extension Link Libraries (CMake) Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt This snippet lists the libraries that the `_torchtext.so` Python extension must link against. Crucially, it links `libtorchtext` itself, ensuring the extension can access the core C++ functionalities, along with other necessary dependencies implicitly handled by `libtorchtext`'s linkage. ```CMake set( EXTENSION_LINK_LIBRARIES libtorchtext ) ``` -------------------------------- ### Performing Text Generation with T5 Model (Python) Source: https://github.com/pytorch/text/blob/main/notebooks/hf_with_torchtext_gen.ipynb This snippet demonstrates text generation using the T5 model. It initializes `GenerationUtil` with the T5 model, tokenizes a test sequence for summarization, and then generates output tokens. The generated tokens are finally decoded and printed, showcasing T5's ability to handle encoder-decoder tasks like summarization. ```python # Testing Huggingface's T5 test_sequence = ["summarize: studies have shown that owning a dog is good for you"] generative_hf_t5 = GenerationUtil(t5, is_encoder_decoder=True, is_huggingface_model=True) t5_tokenizer = T5Tokenizer.from_pretrained("t5-base") test_sequence_tk = t5_tokenizer(test_sequence, return_tensors="pt").input_ids tokens = generative_hf_t5.generate(test_sequence_tk, max_len=20, pad_idx=t5.config.pad_token_id) print(t5_tokenizer.batch_decode(tokens, skip_special_tokens=True)) ``` -------------------------------- ### Defining a New TorchText Dataset Function Source: https://github.com/pytorch/text/blob/main/CONTRIBUTING_DATASETS.md This snippet illustrates the foundational structure for defining a new dataset function within torchtext. It demonstrates the application of the @_create_dataset_directory decorator for managing dataset caching and the @_wrap_split_argument decorator for handling dataset splits (e.g., 'train', 'dev', 'test'). The function signature includes 'root' for the cache directory and 'split' for specifying data subsets, along with placeholders for additional necessary arguments. ```Python DATASET_NAME = "MyDataName" @_create_dataset_directory(dataset_name=DATASET_NAME) @_wrap_split_argument(("train", "dev","test")) def MyDataName(root: str, split: Union[Tuple[str], str], …): … ``` -------------------------------- ### Adding Project Subdirectories Source: https://github.com/pytorch/text/blob/main/CMakeLists.txt Includes the `third_party` and `torchtext/csrc` directories as sub-projects, allowing CMake to process their respective `CMakeLists.txt` files and build their components. ```CMake add_subdirectory(third_party) add_subdirectory(torchtext/csrc) ``` -------------------------------- ### Configuring macOS Specific Build Settings Source: https://github.com/pytorch/text/blob/main/CMakeLists.txt Applies macOS-specific configurations, including detecting the Clang version, enabling RPATH for shared libraries, and setting the shared library suffix to `.so`. ```CMake if(APPLE) # Get clang version on macOS execute_process( COMMAND ${CMAKE_CXX_COMPILER} --version OUTPUT_VARIABLE clang_full_version_string ) string(REGEX REPLACE "Apple LLVM version ([0-9]+\\.[0-9]+).*" "\\1" CLANG_VERSION_STRING ${clang_full_version_string}) message( STATUS "CLANG_VERSION_STRING: " ${CLANG_VERSION_STRING} ) # RPATH stuff set(CMAKE_MACOSX_RPATH ON) set(CMAKE_SHARED_LIBRARY_SUFFIX ".so") endif() ``` -------------------------------- ### Performing Text Generation with BART Model (Python) Source: https://github.com/pytorch/text/blob/main/notebooks/hf_with_torchtext_gen.ipynb This code illustrates text generation using the BART model, specifically for summarization of a news article. It sets up `GenerationUtil` with the BART model, tokenizes the input text, and generates a summary. The output demonstrates BART's capabilities as an encoder-decoder model for abstractive summarization. ```python # Testing Huggingface's BART test_sequence = ["PG&E stated it scheduled the blackouts in response to forecasts for high winds " "amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were " "scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."] generative_hf_bart = GenerationUtil(bart, is_encoder_decoder=True, is_huggingface_model=True) bart_tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn") test_sequence_tk = bart_tokenizer(test_sequence, return_tensors="pt").input_ids tokens = generative_hf_bart.generate(test_sequence_tk, max_len=20, pad_idx=bart.config.pad_token_id) print(bart_tokenizer.batch_decode(tokens, skip_special_tokens=True)) ``` -------------------------------- ### Defining Python Extension Source Files (CMake) Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt This snippet defines the C++ source files specifically for the `_torchtext.so` Python extension. These files (`register_pybindings.cpp`, `vocab_factory.cpp`) contain the necessary C++ code that exposes `libtorchtext` functionalities to Python via Pybind11 or similar binding mechanisms. ```CMake set( EXTENSION_SOURCES register_pybindings.cpp vocab_factory.cpp ) ``` -------------------------------- ### Building libtorchtext Shared Library (CMake) Source: https://github.com/pytorch/text/blob/main/torchtext/csrc/CMakeLists.txt This snippet invokes the `define_library` function to build the `libtorchtext` shared library. It passes the previously defined variables for sources, include directories, link libraries, and compile definitions, centralizing the configuration for the main C++ library. ```CMake define_library( libtorchtext "${LIBTORCHTEXT_SOURCES}" "${LIBTORCHTEXT_INCLUDE_DIRS}" "${LIBTORCHTEXT_LINK_LIBRARIES}" "${LIBTORCHTEXT_COMPILE_DEFINITIONS}" ) ``` -------------------------------- ### Inspecting Saved Files (Bash) Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb This command-line snippet, typically executed within an IPython or Jupyter environment, lists the files in the current directory in a long, human-readable, all-inclusive, and time-sorted format, then pipes the output to `head -3` to display only the first three lines. It's used to quickly verify the presence and details of the recently saved model file. ```Bash !ls -lath | head -3 ``` -------------------------------- ### Initializing Hugging Face Pre-trained Models (Python) Source: https://github.com/pytorch/text/blob/main/notebooks/hf_with_torchtext_gen.ipynb This code initializes three different pre-trained Hugging Face models: T5 for conditional generation, BART for conditional generation (specifically a CNN-optimized version), and GPT2 for language modeling. These models are loaded from their respective pre-trained checkpoints, ready for use in text generation tasks. ```python t5 = T5ForConditionalGeneration.from_pretrained("t5-base") bart = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn") gpt2 = GPT2LMHeadModel.from_pretrained("gpt2") ``` -------------------------------- ### Locating PyTorch Core Libraries Source: https://github.com/pytorch/text/blob/main/CMakeLists.txt Uses `find_library` to locate essential PyTorch libraries (c10, torch, torch_cpu) within the specified `TORCH_INSTALL_PREFIX/lib` directory, which are dependencies for torchtext. ```CMake find_library(TORCH_C10_LIBRARY c10 PATHS "${TORCH_INSTALL_PREFIX}/lib") find_library(TORCH_LIBRARY torch PATHS "${TORCH_INSTALL_PREFIX}/lib") find_library(TORCH_CPU_LIBRARY torch_cpu PATHS "${TORCH_INSTALL_PREFIX}/lib") ``` -------------------------------- ### Adding SentencePiece Library as CMake Subdirectory Source: https://github.com/pytorch/text/blob/main/third_party/CMakeLists.txt This command includes the `sentencepiece` project as a subdirectory. SentencePiece is an unsupervised text tokenizer and detokenizer, commonly used in neural network-based text processing tasks. `EXCLUDE_FROM_ALL` prevents it from being built by default. ```CMake add_subdirectory(sentencepiece EXCLUDE_FROM_ALL) ``` -------------------------------- ### Running C++ Tokenizer Application (Bash) Source: https://github.com/pytorch/text/blob/main/examples/libtorchtext/tokenizer/README.md This snippet executes the compiled C++ tokenizer application, passing the previously created TorchScript tokenizer file as an argument. It processes an input sentence and verifies that the output matches the expected result from the Python script. ```bash ./build/tokenizer/tokenize "tokenizer/${tokenizer_file}" ``` -------------------------------- ### Configuring MSVC Runtime Library Source: https://github.com/pytorch/text/blob/main/CMakeLists.txt For MSVC compilers, this snippet sets the `CMAKE_MSVC_RUNTIME_LIBRARY` to `MultiThreaded` with a debug variant for debug configurations, ensuring correct linking with the C runtime. ```CMake if(MSVC) set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreaded$<$:Debug>") endif() ``` -------------------------------- ### Adding utf8proc Library as CMake Subdirectory Source: https://github.com/pytorch/text/blob/main/third_party/CMakeLists.txt This command includes the `utf8proc` project as a subdirectory. `utf8proc` is a small, clean C library for processing UTF-8 Unicode data, offering functions for normalization, case-folding, and character properties. `EXCLUDE_FROM_ALL` prevents it from being built by default. ```CMake add_subdirectory(utf8proc EXCLUDE_FROM_ALL) ``` -------------------------------- ### Importing Pre-trained T5 Generation Models from PyTorch Text Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb This snippet imports various pre-trained T5 generation model bundles (small, large, 3B, 11B parameters) from the `torchtext.models` module. These bundles provide configurations and checkpoints for different scales of the T5 model. ```Python from torchtext.models import T5_SMALL_GENERATION, T5_LARGE_GENERATION, T5_3B_GENERATION, T5_11B_GENERATION ``` -------------------------------- ### Measuring T5 Model Inference Time on CPU using IPython Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb This IPython magic command measures the execution time of the JIT-compiled T5-Large model (`t5_large`) when performing inference on the `EXAMPLE_INPUT` with a maximum output length of 100 tokens. This demonstrates CPU performance. ```Python %time t5_large(EXAMPLE_INPUT, max_length=100) ``` -------------------------------- ### Adding Progress Bars with TQDM (Python) Source: https://github.com/pytorch/text/blob/main/requirements.txt This snippet specifies the 'tqdm' library, which is used to display smart progress bars for iterators in Python applications, providing visual feedback during long-running operations. ```Python tqdm ``` -------------------------------- ### Enabling Compile Commands and PIC Source: https://github.com/pytorch/text/blob/main/CMakeLists.txt Enables the generation of `compile_commands.json` for tooling and sets `CMAKE_POSITION_INDEPENDENT_CODE` to ON, which is crucial for shared libraries. ```CMake set(CMAKE_EXPORT_COMPILE_COMMANDS ON) set(CMAKE_POSITION_INDEPENDENT_CODE ON) ``` -------------------------------- ### Integrating Optional NLP Tools (Python) Source: https://github.com/pytorch/text/blob/main/requirements.txt This section lists optional Natural Language Processing (NLP) tools including NLTK, spaCy, and sacremoses, along with a specific Git repository for 'revtok', providing advanced text processing capabilities. ```Python nltk spacy sacremoses git+https://github.com/jekbradbury/revtok.git ``` -------------------------------- ### Downloading Files with Requests (Python) Source: https://github.com/pytorch/text/blob/main/requirements.txt This snippet includes the 'requests' library, a popular HTTP library for Python, used for making web requests to download data and other files from the internet. ```Python requests ``` -------------------------------- ### Configuring C/C++ Standard Versions Source: https://github.com/pytorch/text/blob/main/CMakeLists.txt Checks if a C++ standard is already defined in environment variables and warns the user if it conflicts with the required C++17. It then explicitly sets C++ standard to 17 and C standard to 11 for the project. ```CMake string(FIND "${CMAKE_CXX_FLAGS}" "-std=c++" env_cxx_standard) if(env_cxx_standard GREATER -1) message( WARNING "C++ standard version definition detected in environment variable." "PyTorch requires -std=c++17. Please remove -std=c++ settings in your environment.") endif() set(CMAKE_CXX_STANDARD 17) set(CMAKE_C_STANDARD 11) ``` -------------------------------- ### Demonstrating TorchScriptability Issue with TorchText T5 and GenerationUtils (Python) Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb This snippet illustrates the problem of GenerationUtils breaking TorchScript compatibility for TorchText's T5 model. It shows that the tokenizer and the base T5 model are initially TorchScriptable, but wrapping the model with GenerationUtils prevents it from being JIT-scripted, leading to an exception. The failure is attributed to **kwargs, optional values, and multiple return types. ```Python %load_ext autoreload %autoreload 2 import torch from torchtext.prototype.generate import GenerationUtils from torchtext.models import T5_SMALL_GENERATION # The tokenizer object is torchscriptable tokenizer = T5_SMALL_GENERATION.transform() tokenizer_jit = torch.jit.script(tokenizer) # The T5 model is also torchscriptable model = T5_SMALL_GENERATION.get_model() model_jit = torch.jit.script(model) # But after wrapping with GenerationUtils, the model is no longer torchscriptable generative_model = GenerationUtils(model) generative_model_jit = torch.jit.script(generative_model) ``` -------------------------------- ### Measuring T5 Model Inference Time on GPU using IPython Source: https://github.com/pytorch/text/blob/main/notebooks/torchscriptable_t5_with_torchtext.ipynb This snippet first loads a JIT-compiled T5-Large model configured for CUDA (GPU) acceleration. It then uses an IPython magic command to measure the inference time on the GPU, allowing for a comparison of performance between CPU and GPU. ```Python # Try to load to GPU and compare the time difference t5_large_gpu = get_jit_from_bundle(T5_LARGE_GENERATION, cuda=True) %time t5_large_gpu(EXAMPLE_INPUT, max_length=100) ``` -------------------------------- ### Appending C++ Compiler Flags Source: https://github.com/pytorch/text/blob/main/CMakeLists.txt Appends additional C++ compiler flags, including the C++ ABI definition for compatibility, `-Wall` for all warnings, and any existing `TORCH_CXX_FLAGS`. ```CMake set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_COMPILED_WITH_CXX_ABI} -Wall ${TORCH_CXX_FLAGS}") ``` -------------------------------- ### Configuring DataLoader for Multi-processing with torchtext.datasets (Python) Source: https://github.com/pytorch/text/blob/main/docs/source/datasets.rst This snippet demonstrates how to properly configure `torch.utils.data.DataLoader` for multi-processing when working with `torchtext` datapipes. By using `worker_init_fn` from `torch.utils.data.backward_compatibility`, it ensures that data is not duplicated across workers. The `drop_last=True` parameter is also recommended to maintain consistent batch sizes. ```Python from torch.utils.data.backward_compatibility import worker_init_fn DataLoader(dp, num_workers=4, worker_init_fn=worker_init_fn, drop_last=True) ```