### Example tokenizer.json Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/configuration.md An example of a tokenizer.json file, illustrating the configuration for a BPE model with specific normalizer, pre-tokenizer, post-processor, and decoder settings. ```json { "version": "1.0", "model": { "type": "BPE", "vocab": { "hello": 0, "world": 1 }, "merges": ["h e", "l l"], "unk_token": "" }, "normalizer": { "type": "Lowercase" }, "pre_tokenizer": { "type": "ByteLevel", "add_prefix_space": true }, "post_processor": { "type": "BertProcessing", "cls": ["[CLS]", 101], "sep": ["[SEP]", 102] }, "decoder": { "type": "ByteLevel" } } ``` -------------------------------- ### Install Tokenizers.js via npm Source: https://github.com/huggingface/tokenizers.js/blob/main/README.md Install the library using npm for use in your project. ```bash npm install @huggingface/tokenizers ``` -------------------------------- ### Example tokenizer_config.json Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/configuration.md An example of a tokenizer_config.json file, showing configurations for a BERT model including special tokens, model type, and processing flags. ```json { "tokenizer_class": "PreTrainedTokenizerFast", "model_type": "bert", "name_or_path": "bert-base-uncased", "vocab_size": 30522, "model_max_length": 512, "do_lower_case": true, "tokenize_chinese_chars": true, "strip_accents": false, "cls_token": "[CLS]", "sep_token": "[SEP]", "unk_token": "[UNK]", "pad_token": "[PAD]", "mask_token": "[MASK]", "clean_up_tokenization_spaces": true } ``` -------------------------------- ### Example Usage of WordPiece Tokenizer Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Models.md Demonstrates how to create and use a WordPiece tokenizer with a sample vocabulary and configuration. ```javascript import { WordPiece } from "@huggingface/tokenizers"; const model = new WordPiece({ vocab: { "hello": 0, "world": 1, "hel": 2, "##lo": 3, "": 100 }, unk_token: "", continuing_subword_prefix: "##", max_input_chars_per_word: 100 }); const tokens = model(["hello", "world"]); // ["hello", "world"] or ["hel", "##lo", "world"] depending on vocab ``` -------------------------------- ### Example Usage of MetaspacePreTokenizer Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md Demonstrates how to instantiate and use MetaspacePreTokenizer to tokenize a string, showing the effect of the replacement and prepend_scheme configurations. ```javascript import { MetaspacePreTokenizer } from "@huggingface/tokenizers"; const pretokenizer = new MetaspacePreTokenizer({ replacement: "▁", prepend_scheme: "always" }); const tokens = pretokenizer("Hello world"); // ["▁Hello", "▁world"] ``` -------------------------------- ### Configure SequencePostProcessor Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md Example of setting up a SequencePostProcessor that applies BertProcessing followed by ByteLevel processing. This is useful for complex tokenization pipelines. ```javascript import { SequencePostProcessor } from "@huggingface/tokenizers"; const postprocessor = new SequencePostProcessor({ processors: [ { type: "BertProcessing", cls: ["[CLS]", 101], sep: ["[SEP]", 102] }, { type: "ByteLevel", trim_offsets: true } ] }); const result = postprocessor(["hello", "world"], null, true); ``` -------------------------------- ### Post-Processor Import Usage Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/exports.md Example of how to import PostProcessor and specific post-processing implementations from the @huggingface/tokenizers package. ```javascript import { PostProcessor, BertProcessingPostProcessor, RobertaProcessingPostProcessor } from "@huggingface/tokenizers"; ``` -------------------------------- ### Basic Decoder Usage Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md Demonstrates the basic usage of a decoder by instantiating `ByteLevelDecoder` and calling it with a list of tokens to produce text. ```javascript const decoder = new ByteLevelDecoder({}); const text = decoder(["Hello", "Ġworld"]); // "Hello world" ``` -------------------------------- ### Configure ByteLevelPostProcessor Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md Example of creating a ByteLevelPostProcessor with specific options for adding a prefix space, trimming offsets, and enabling regex splitting. ```javascript import { ByteLevelPostProcessor } from "@huggingface/tokenizers"; const postprocessor = new ByteLevelPostProcessor({ add_prefix_space: true, trim_offsets: true, use_regex: true }); const result = postprocessor(["Hello", "Ġworld"], null, false); ``` -------------------------------- ### Import Pre-Tokenizers Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/exports.md Demonstrates how to import specific pre-tokenizer classes from the @huggingface/tokenizers package. Ensure the package is installed before use. ```javascript import { PreTokenizer, ByteLevelPreTokenizer, WhitespacePreTokenizer, BertPreTokenizer } from "@huggingface/tokenizers"; ``` -------------------------------- ### Tokenization Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/types.md Demonstrates how to use the tokenizer to encode a single string or a pair of strings, with options for returning token type IDs. ```javascript const encoding = tokenizer.encode("Hello world"); // { // ids: [9906, 4435], // tokens: ['Hello', 'Ġworld'], // attention_mask: [1, 1] // } ``` ```javascript const encoding_pair = tokenizer.encode("Hello world", { text_pair: "How are you?", return_token_type_ids: true }); // { // ids: [101, 9906, 4435, 102, 2129, 2024, 2017, 102], // tokens: ['[CLS]', 'Hello', 'Ġworld', '[SEP]', 'How', 'are', 'you', '[SEP]'], // attention_mask: [1, 1, 1, 1, 1, 1, 1, 1], // token_type_ids: [0, 0, 0, 0, 1, 1, 1, 1] // } ``` -------------------------------- ### Model Import Usage Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/exports.md Example of how to import Model and its implementations from the @huggingface/tokenizers package. ```javascript import { Model, BPE, WordPiece, Unigram } from "@huggingface/tokenizers"; ``` -------------------------------- ### BertProcessingPostProcessor Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md Demonstrates how to use BertProcessingPostProcessor for single and paired sequences. Ensure the correct import statement is used. ```javascript import { BertProcessingPostProcessor } from "@huggingface/tokenizers"; const postprocessor = new BertProcessingPostProcessor({ cls: ["[CLS]", 101], sep: ["[SEP]", 102] }); // Single sequence const result1 = postprocessor(["hello", "world"], null, true); // { tokens: ["[CLS]", "hello", "world", "[SEP]"], token_type_ids: [0, 0, 0, 0] } // Paired sequences const result2 = postprocessor(["hello"], ["world"], true); // { tokens: ["[CLS]", "hello", "[SEP]", "world", "[SEP]"], token_type_ids: [0, 0, 0, 1, 1] } ``` -------------------------------- ### WhitespacePreTokenizer Usage Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md Shows how to use WhitespacePreTokenizer to split a string by whitespace. It handles spaces, tabs, and newlines. ```javascript import { WhitespacePreTokenizer } from "@huggingface/tokenizers"; const pretokenizer = new WhitespacePreTokenizer({}); const tokens = pretokenizer("Hello world\ttab"); // ["Hello", "world", "tab"] ``` -------------------------------- ### Configure and Use RobertaProcessingPostProcessor Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md Example demonstrating how to instantiate and use the RobertaProcessingPostProcessor with custom CLS and SEP tokens, and offset trimming settings. The output includes the processed tokens and their type IDs. ```javascript import { RobertaProcessingPostProcessor } from "@huggingface/tokenizers"; const postprocessor = new RobertaProcessingPostProcessor({ cls: ["", 0], sep: ["", 2], trim_offsets: true, add_prefix_space: false }); const result = postprocessor(["hello", "world"], null, true); // { tokens: ["", "hello", "world", ""], token_type_ids: [0, 0, 0, 0] } ``` -------------------------------- ### Import All Main Components (JavaScript) Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/exports.md Imports all core components, normalizers, pre-tokenizers, models, post-processors, and decoders from the library. Ensure you have the library installed. ```javascript import { // Core Tokenizer, AddedToken, // Types Encoding, // Normalizers BertNormalizer, LowercaseNormalizer, NFKCNormalizer, // Pre-Tokenizers ByteLevelPreTokenizer, WhitespacePreTokenizer, BertPreTokenizer, // Models BPE, WordPiece, Unigram, // Post-Processors BertProcessingPostProcessor, RobertaProcessingPostProcessor, // Decoders ByteLevelDecoder, WordPieceDecoder, BPEDecoder } from "@huggingface/tokenizers"; ``` -------------------------------- ### Use FixedLengthPreTokenizer Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md Example of splitting a string into chunks of length 4 using FixedLengthPreTokenizer. ```javascript import { FixedLengthPreTokenizer } from "@huggingface/tokenizers"; const pretokenizer = new FixedLengthPreTokenizer({ length: 4 }); const tokens = pretokenizer("helloworld"); // ["hell", "o", "worl", "d"] ``` -------------------------------- ### Create and Use SequenceDecoder Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md Example of creating a SequenceDecoder with multiple nested decoders and applying it to tokens. Ensure the necessary decoders are imported. ```javascript import { SequenceDecoder } from "@huggingface/tokenizers"; const decoder = new SequenceDecoder({ decoders: [ { type: "ByteLevel", trim_offsets: true }, { type: "Fuse" }, { type: "Strip", content: " ", start: 0, stop: 1 } ] }); const text = decoder(tokens); ``` -------------------------------- ### Example Usage of PunctuationPreTokenizer Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md Shows how to create and use PunctuationPreTokenizer to tokenize a string containing punctuation, illustrating the 'Isolated' behavior. ```javascript import { PunctuationPreTokenizer } from "@huggingface/tokenizers"; const pretokenizer = new PunctuationPreTokenizer({ behavior: "Isolated" }); const tokens = pretokenizer("Hello, world!"); // ["Hello", ",", "world", "!"] ``` -------------------------------- ### Use CharDelimiterSplitPreTokenizer Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md Example of splitting a string by a comma delimiter using CharDelimiterSplitPreTokenizer. ```javascript import { CharDelimiterSplitPreTokenizer } from "@huggingface/tokenizers"; const pretokenizer = new CharDelimiterSplitPreTokenizer({ delimiter: "," }); const tokens = pretokenizer("apple,banana,cherry"); // ["apple", "banana", "cherry"] ``` -------------------------------- ### ByteLevelPreTokenizer Usage Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md Demonstrates how to use ByteLevelPreTokenizer to tokenize a string. It converts text to byte representations before tokenization, useful for GPT-2 style tokenization. ```javascript import { ByteLevelPreTokenizer } from "@huggingface/tokenizers"; const pretokenizer = new ByteLevelPreTokenizer({ add_prefix_space: true, use_regex: true }); const tokens = pretokenizer("Hello world"); // ['Hello', 'Ġworld'] (Ġ represents byte encoding of space) ``` -------------------------------- ### Use ReplacePreTokenizer Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md Example of replacing a hyphen with a space using ReplacePreTokenizer. ```javascript import { ReplacePreTokenizer } from "@huggingface/tokenizers"; const pretokenizer = new ReplacePreTokenizer({ pattern: { String: "-" }, content: " " }); const tokens = pretokenizer("hello-world"); // ["hello", "world"] ``` -------------------------------- ### PostProcessor Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/CoreInterfaces.md Demonstrates using a `PostProcessor` component, extending `Callable`, to format token arrays for specific model inputs. It accepts tokens and optional arguments. ```typescript const processor = new BertProcessingPostProcessor(config); const result = processor(tokens, null, true); // Returns { tokens, token_type_ids } ``` -------------------------------- ### Example Usage of Unigram Tokenizer Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Models.md Demonstrates how to create and use a Unigram tokenizer with a sample vocabulary and an end-of-sequence token. Automatically fuses unknown tokens. ```javascript import { Unigram } from "@huggingface/tokenizers"; const model = new Unigram({ vocab: [ ["", -10], ["hello", 0], ["world", 0], ["he", -5], ["llo", -5] ], unk_id: 0 }, ""); const tokens = model(["hello", "world"]); ``` -------------------------------- ### Incorrect Import from Submodule Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/exports.md This example shows the incorrect way to import a component by referencing an internal submodule. This method should be avoided. ```javascript // ❌ Wrong - don't import from submodules import BertNormalizer from "@huggingface/tokenizers/core/normalizer/BertNormalizer"; ``` -------------------------------- ### PreTokenizer Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/CoreInterfaces.md Shows how a `PreTokenizer` component, extending `Callable`, is used to split input strings into an array of tokens. It accepts a string or an array of strings as input. ```typescript const pretokenizer = new WhitespacePreTokenizer({}); const tokens = pretokenizer("hello world"); // Returns ["hello", "world"] ``` -------------------------------- ### Model Encoding Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Models.md Demonstrates how to encode pre-tokenized strings using a model. The output can vary based on the vocabulary and model type, potentially splitting tokens into subwords. ```javascript const pretokens = ["hello", "world"]; const encoded = model(pretokens); // encoded might be ["hello", "world"] or // ["hel", "##lo", "world"] depending on vocab and model type ``` -------------------------------- ### AddedToken Behavior Rules Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/AddedToken.md Explains the behavior rules for AddedToken, focusing on whitespace stripping and normalization, and provides examples. ```APIDOC ## Behavior Rules **Whitespace Stripping:** - When `lstrip=true` and token appears after other content, preceding whitespace is trimmed - When `rstrip=true` and token appears before other content, following whitespace is trimmed - Useful for preventing spaces around special tokens like "[MASK]" **Normalization:** - By default, special tokens are not normalized (`normalized=false`) - Regular added tokens are normalized by default (`normalized=true`) - Override with explicit `normalized` value **Example with stripping:** ```javascript const mask_token = new AddedToken({ content: "[MASK]", id: 103, special: true, lstrip: true, rstrip: true }); // With this config, " [MASK] " becomes "[MASK]" // The surrounding spaces are handled by the tokenizer ``` ``` -------------------------------- ### BPE Unknown Token Handling Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/errors.md Demonstrates how the `fuse_unk` flag affects the merging of consecutive unknown tokens. Ensure the `BPE` class is imported. ```javascript import { BPE } from "@huggingface/tokenizers"; const model = new BPE({ vocab: { "hello": 0, "": 100 }, merges: [], unk_token: "", fuse_unk: true, byte_fallback: false }); // Unknown word behavior depends on fuse_unk flag const result = model(["unknown_word"]); ``` -------------------------------- ### TokenizerModel Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/CoreInterfaces.md Illustrates the usage of a `TokenizerModel` (e.g., BPE) which extends `Callable`. It takes an array of strings and returns an array of encoded tokens. ```typescript const model = new BPE(config); const encoded = model(["hello", "world"]); // Returns encoded tokens ``` -------------------------------- ### Use FuseDecoder to join tokens Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md Example of instantiating FuseDecoder and using it to concatenate tokens into a single string without spaces. ```javascript import { FuseDecoder } from "@huggingface/tokenizers"; const decoder = new FuseDecoder({}); const text = decoder(["h", "e", "l", "l", "o"]); // "hello" ``` -------------------------------- ### Use ByteFallbackDecoder for byte-encoded content Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md Example of instantiating ByteFallbackDecoder to process tokens that may include byte-encoded fallback content. ```javascript import { ByteFallbackDecoder } from "@huggingface/tokenizers"; const decoder = new ByteFallbackDecoder({}); const text = decoder(tokens); // Handles byte-encoded fallback content ``` -------------------------------- ### Configure and Use TemplateProcessingPostProcessor with GPT-2 Template Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md Example of setting up the TemplateProcessingPostProcessor using a GPT-2 style template. This configuration defines how single and paired sequences are processed, including the use of a custom separator token. ```javascript import { TemplateProcessingPostProcessor } from "@huggingface/tokenizers"; const postprocessor = new TemplateProcessingPostProcessor({ single: [ { Sequence: { id: "A" } } ], pair: [ { Sequence: { id: "A" } }, { SpecialToken: { id: "sep", ids: [50256], tokens: ["<|endoftext|>"] } }, { Sequence: { id: "B" } } ], special_tokens: { sep: { id: "sep", ids: [50256], tokens: ["<|endoftext|>"] } } }); const result = postprocessor(["hello"], ["world"], true); // { tokens: ["hello", "<|endoftext|>", "world"] } ``` -------------------------------- ### Correct Named Export Import Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/exports.md This example demonstrates the correct way to import components using named exports directly from the main package. This is the recommended approach for proper tree-shaking and bundling. ```javascript // ✓ Correct import { BertNormalizer } from "@huggingface/tokenizers"; ``` -------------------------------- ### NFKDNormalizer Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Normalizers.md Applies Unicode Normalization Form KD (Compatibility Decomposition). Use this to convert compatibility characters into their canonical equivalents. ```javascript import { NFKDNormalizer } from "@huggingface/tokenizers"; const normalizer = new NFKDNormalizer({}); const result = normalizer("ℌello"); // "Hello" (compatibility characters converted) ``` -------------------------------- ### Normalizer Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/CoreInterfaces.md Demonstrates using a `Normalizer` component, which extends `Callable`, to normalize input strings. The component is invoked directly like a function. ```typescript const normalizer = new LowercaseNormalizer({}); const normalized = normalizer("Hello"); // Returns "hello" ``` -------------------------------- ### NFKCNormalizer Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Normalizers.md Applies Unicode Normalization Form KC (Compatibility Composition). Use this to convert compatibility characters into their composed canonical equivalents. ```javascript import { NFKCNormalizer } from "@huggingface/tokenizers"; const normalizer = new NFKCNormalizer({}); const result = normalizer("ℌello"); // "Hello" (compatibility form) ``` -------------------------------- ### Tokenizer Encoding with Post-Processing Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md Demonstrates how a tokenizer's encode method utilizes post-processors to add special tokens and generate token type IDs. This is a typical integration example. ```javascript const tokenizer = new Tokenizer(tokenizerJson, config); // During encoding: const encoded = tokenizer.encode("Hello world", { text_pair: "How are you?", add_special_tokens: true, return_token_type_ids: true }); // The post-processor adds special tokens and token type IDs: // encoded = { // ids: [101, 7592, 2088, 102, 2129, 2024, 2017, 102], // tokens: ["[CLS]", "Hello", "world", "[SEP]", "How", "are", "you", "[SEP]"], // attention_mask: [1, 1, 1, 1, 1, 1, 1, 1], // token_type_ids: [0, 0, 0, 0, 1, 1, 1, 1] // } ``` -------------------------------- ### Instantiate StripDecoder Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md Use StripDecoder to remove a specified substring from the start and/or end of concatenated tokens. Configure with content, start, and stop positions. ```typescript new StripDecoder(config: TokenizerConfigDecoderStrip) ``` -------------------------------- ### WhitespaceSplitPreTokenizer Usage Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md Demonstrates the usage of WhitespaceSplitPreTokenizer to split a string by spaces only. It does not split on tabs or newlines. ```javascript import { WhitespaceSplitPreTokenizer } from "@huggingface/tokenizers"; const pretokenizer = new WhitespaceSplitPreTokenizer({}); const tokens = pretokenizer("Hello world"); // ["Hello", "world"] ``` -------------------------------- ### Initialize ByteLevelPostProcessor Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md Instantiate a ByteLevelPostProcessor with custom configuration. Use this to control prefix space addition, offset trimming, and regex usage for byte-level tokenization. ```typescript new ByteLevelPostProcessor(config: TokenizerConfigPostProcessorByteLevel) ``` -------------------------------- ### Initialize and Use Tokenizer in JavaScript Source: https://github.com/huggingface/tokenizers.js/blob/main/README.md Load tokenizer configuration files from the Hugging Face Hub and initialize a Tokenizer instance. This snippet demonstrates tokenizing, encoding, and decoding text. ```javascript import { Tokenizer } from "@huggingface/tokenizers"; // Load files from the Hugging Face Hub const modelId = "HuggingFaceTB/SmolLM3-3B"; const tokenizerJson = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer.json`).then((res) => res.json()); const tokenizerConfig = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`).then((res) => res.json()); // Create tokenizer const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig); // Tokenize text const tokens = tokenizer.tokenize("Hello World"); // ['Hello', 'ĠWorld'] const encoded = tokenizer.encode("Hello World"); // { ids: [9906, 4435], tokens: ['Hello', 'ĠWorld'], attention_mask: [1, 1] } const decoded = tokenizer.decode(encoded.ids); // 'Hello World' ``` -------------------------------- ### Initialize and Use BertPreTokenizer Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md Shows how to instantiate and use the BertPreTokenizer, which splits text based on whitespace and punctuation, isolating punctuation as separate tokens. Requires importing the class. ```typescript import { BertPreTokenizer } from "@huggingface/tokenizers"; const pretokenizer = new BertPreTokenizer({}); const tokens = pretokenizer("Hello, world!"); // ["Hello", ",", "world", "!"] ``` -------------------------------- ### BPE Model Initialization and Usage Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Models.md Initializes a Byte-Pair Encoding model with vocabulary and merge rules. Demonstrates encoding pre-tokenized strings and clearing the internal cache. ```javascript import { BPE } from "@huggingface/tokenizers"; const model = new BPE({ vocab: { "hello": 0, "world": 1, "he": 2, "llo": 3 }, merges: ["h e", "l l"], unk_token: "", end_of_word_suffix: "" }); const tokens = model(["hello", "world"]); // Returns merged tokens: ["he", "llo", "world"] model.clear_cache(); // Free memory ``` -------------------------------- ### Initialize SequencePostProcessor Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md Create a SequencePostProcessor to chain multiple post-processing steps. Configure it with an array of processor definitions. ```typescript new SequencePostProcessor(config: TokenizerConfigPostProcessorSequence) ``` -------------------------------- ### Use StripDecoder to remove leading/trailing spaces Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md Example of using StripDecoder to remove leading and trailing space characters from a list of tokens. ```javascript import { StripDecoder } from "@huggingface/tokenizers"; const decoder = new StripDecoder({ content: " ", start: 0, stop: 1 }); const text = decoder([" ", "hello", " ", "world"]); // "hello world" (leading/trailing spaces stripped) ``` -------------------------------- ### Initialize Tokenizer from Local Files Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/usage-patterns.md Loads tokenizer configuration and vocabulary from local files within a specified directory. Requires Node.js `fs/promises` module. ```javascript import { Tokenizer } from "@huggingface/tokenizers"; import fs from "fs/promises"; async function initTokenizerLocal(directory) { const tokenizerJson = JSON.parse( await fs.readFile(`${directory}/tokenizer.json`, "utf-8") ); const tokenizerConfig = JSON.parse( await fs.readFile(`${directory}/tokenizer_config.json`, "utf-8") ); return new Tokenizer(tokenizerJson, tokenizerConfig); } ``` -------------------------------- ### NFCNormalizer Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Normalizers.md Applies Unicode Normalization Form C (Canonical Composition). Use this to compose decomposed characters into their precomposed forms. ```javascript import { NFCNormalizer } from "@huggingface/tokenizers"; const normalizer = new NFCNormalizer({}); const result = normalizer("é"); // "é" (base + diacritic composed into single character) ``` -------------------------------- ### NFDNormalizer Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Normalizers.md Applies Unicode Normalization Form D (Canonical Decomposition). Use this to decompose characters into their base and combining diacritic forms. ```javascript import { NFDNormalizer } from "@huggingface/tokenizers"; const normalizer = new NFDNormalizer({}); const result = normalizer("é"); // "é" (decomposed into base + combining diacritic) ``` -------------------------------- ### ByteLevelDecoder Initialization and Usage Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md Initializes a `ByteLevelDecoder` with specific configuration options and demonstrates its use in converting byte-level encoded tokens back to UTF-8 text. ```javascript import { ByteLevelDecoder } from "@huggingface/tokenizers"; const decoder = new ByteLevelDecoder({ add_prefix_space: false, trim_offsets: false, use_regex: true }); // Assuming tokens contain byte-level encoded text const text = decoder(["Hello", "Ġworld"]); // "Hello world" ``` -------------------------------- ### Get Vocabulary Map Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Tokenizer.md Retrieves the complete vocabulary as a Map, mapping token strings to their IDs. Can optionally include user-added tokens. ```typescript get_vocab(with_added_tokens?: boolean): Map ``` ```javascript const vocab = tokenizer.get_vocab(); const vocab_size = vocab.size; const vocab_no_added = tokenizer.get_vocab(false); ``` -------------------------------- ### Initialize RobertaProcessingPostProcessor Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md Use this post-processor for RoBERTa-style tokenization, which includes specific configurations for CLS and SEP tokens. It handles trimming offsets and adding prefix spaces based on the provided configuration. ```typescript new RobertaProcessingPostProcessor(config: TokenizerConfigPostProcessorRoberta) ``` -------------------------------- ### Load Tokenizer from Hugging Face Hub Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/README.md Load tokenizer configuration files (tokenizer.json and tokenizer_config.json) from the Hugging Face Hub to initialize a Tokenizer instance. ```javascript import { Tokenizer } from "@huggingface/tokenizers"; const modelId = "HuggingFaceTB/SmolLM3-3B"; const tokenizerJson = await fetch( `https://huggingface.co/${modelId}/resolve/main/tokenizer.json` ).then(res => res.json()); const tokenizerConfig = await fetch( `https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json` ).then(res => res.json()); const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig); ``` -------------------------------- ### BertNormalizer Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Normalizers.md Instantiates and uses BertNormalizer with various configuration options for text cleaning, CJK character spacing, lowercasing, and accent stripping. ```javascript import { BertNormalizer } from "@huggingface/tokenizers"; const normalizer = new BertNormalizer({ clean_text: true, handle_chinese_chars: true, lowercase: true, strip_accents: true }); const result = normalizer("Hello Wörld"); // "hello world" (accents removed, lowercased) ``` -------------------------------- ### StripDecoder Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md Strips a specified substring from the start and/or end of concatenated tokens. It allows for precise control over which parts of the tokenized string are removed. ```APIDOC ## StripDecoder ### Description Strips a specified substring from the start and/or end of concatenated tokens. It allows for precise control over which parts of the tokenized string are removed. ### Constructor ```typescript new StripDecoder(config: TokenizerConfigDecoderStrip) ``` ### Configuration | Config | Type | Description | |--------|------|-------------| | content | string | Content to search for and remove | | start | number | Start position for stripping | | stop | number | End position for stripping | ### Example ```javascript import { StripDecoder } from "@huggingface/tokenizers"; const decoder = new StripDecoder({ content: " ", start: 0, stop: 1 }); const text = decoder([" ", "hello", " ", "world"]); // "hello world" (leading/trailing spaces stripped) ``` ``` -------------------------------- ### ByteLevelPreTokenizer Initialization Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md Initializes the ByteLevelPreTokenizer with configuration options. Use this for BPE models requiring byte-level processing. ```typescript new ByteLevelPreTokenizer(config: TokenizerConfigPreTokenizerByteLevel) ``` -------------------------------- ### Initialize Tokenizer from Hugging Face Hub Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/usage-patterns.md Loads tokenizer configuration and vocabulary from the Hugging Face Hub using a model ID. Requires an asynchronous context. ```javascript import { Tokenizer } from "@huggingface/tokenizers"; async function initTokenizer(modelId) { // Fetch tokenizer files from Hugging Face Hub const tokenizerJson = await fetch( `https://huggingface.co/${modelId}/resolve/main/tokenizer.json` ).then(res => res.json()); const tokenizerConfig = await fetch( `https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json` ).then(res => res.json()); // Create tokenizer return new Tokenizer(tokenizerJson, tokenizerConfig); } // Usage const tokenizer = await initTokenizer("HuggingFaceTB/SmolLM3-3B"); ``` -------------------------------- ### Decoder Example Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/CoreInterfaces.md Shows how a `Decoder` component, extending `Callable`, reconstructs text from an array of tokens. It takes the token array and returns the decoded string. ```typescript const decoder = new ByteLevelDecoder({}); const text = decoder(tokens); // Returns reconstructed text ``` -------------------------------- ### RobertaProcessingPostProcessor Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md Initializes a RoBERTa-style post-processor. It's similar to BERT's post-processor but with a different special token configuration. ```APIDOC ## RobertaProcessingPostProcessor ### Description Initializes a RoBERTa-style post-processor, similar to BERT but with a different special token configuration. ### Constructor `new RobertaProcessingPostProcessor(config: TokenizerConfigPostProcessorRoberta)` ### Parameters #### Config - **cls** (string | number) - CLS token and its ID. - **sep** (string | number) - SEP token and its ID. - **trim_offsets** (boolean) - Trim offsets to remove whitespace (default: true). - **add_prefix_space** (boolean) - Add space prefix to first token (default: false). ### Example ```javascript import { RobertaProcessingPostProcessor } from "@huggingface/tokenizers"; const postprocessor = new RobertaProcessingPostProcessor({ cls: ["", 0], sep: ["", 2], trim_offsets: true, add_prefix_space: false }); const result = postprocessor(["hello", "world"], null, true); // { tokens: ["", "hello", "world", ""], token_type_ids: [0, 0, 0, 0] } ``` ``` -------------------------------- ### Initialize FixedLengthPreTokenizer Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md Instantiate a pre-tokenizer that splits text into fixed-length chunks. Requires a 'length' configuration. ```typescript new FixedLengthPreTokenizer(config: TokenizerConfigPreTokenizerFixedLength) ``` -------------------------------- ### Get Added Tokens Decoder Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Tokenizer.md Returns a Map that decodes user-added token IDs back to their AddedToken objects, containing properties like content. ```typescript get_added_tokens_decoder(): Map ``` ```javascript const added_tokens = tokenizer.get_added_tokens_decoder(); for (const [id, token] of added_tokens) { console.log(`ID ${id}: ${token.content}`); } ``` -------------------------------- ### Initialize TemplateProcessingPostProcessor Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md This post-processor offers maximum flexibility using a template syntax for arranging tokens. It supports configurations for single and paired sequences, along with custom special tokens. ```typescript new TemplateProcessingPostProcessor(config: TokenizerConfigPostProcessorTemplateProcessing) ``` -------------------------------- ### Normalizer Base Class Usage Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Normalizers.md Demonstrates how to instantiate a normalizer (e.g., BertNormalizer) and call it to normalize text. ```javascript const normalizer = new BertNormalizer(config); const normalized = normalizer("Hello WORLD"); ``` -------------------------------- ### Load Tokenizers.js via CDN Source: https://github.com/huggingface/tokenizers.js/blob/main/README.md Include the library in your HTML using a CDN for browser-based applications. ```html ``` -------------------------------- ### Get Complete Vocabulary Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/usage-patterns.md Retrieve the entire vocabulary of the tokenizer, including any added special tokens. The size of the vocabulary can be logged, and the vocabulary can be converted to an array for iteration. ```javascript // Get all tokens with their IDs const vocab = tokenizer.get_vocab(); console.log(vocab.size); // Total vocabulary size // Convert to array for iteration const vocabArray = Array.from(vocab).slice(0, 10); console.log(vocabArray); // [["hello", 0], ["world", 1], ...] ``` -------------------------------- ### Tokenizer Constructor Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Tokenizer.md Creates a new Tokenizer instance that orchestrates the tokenization pipeline. It takes tokenizer configuration and additional settings as input. ```APIDOC ## Constructor Tokenizer ### Description Creates a new Tokenizer instance that orchestrates the tokenization pipeline. It takes tokenizer configuration and additional settings as input. ### Method ```typescript new Tokenizer(tokenizer: Object, config: Object) ``` ### Parameters #### Path Parameters - **tokenizer** (Object) - Required - Tokenizer configuration from `tokenizer.json`. Must include properties: `model`, `decoder`, `post_processor`, `pre_tokenizer`, `normalizer` - **config** (Object) - Required - Additional configuration from `tokenizer_config.json`. Can include special tokens, processing flags, and custom settings ### Throws - Error if tokenizer object is missing required properties - Error if config object is invalid or missing ### Example ```javascript import { Tokenizer } from "@huggingface/tokenizers"; const modelId = "HuggingFaceTB/SmolLM3-3B"; const tokenizerJson = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer.json`) .then((res) => res.json()); const tokenizerConfig = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`) .then((res) => res.json()); const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig); ``` ``` -------------------------------- ### Get Vocabulary Without Added Tokens Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/usage-patterns.md Retrieve only the original model vocabulary, excluding any user-added special tokens. This is useful when you need to work with the base vocabulary of the pre-trained model. ```javascript // Original model vocabulary only const modelVocab = tokenizer.get_vocab(false); // Excludes user-added special tokens ``` -------------------------------- ### SequencePreTokenizer Constructor Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md Initializes a new SequencePreTokenizer by chaining multiple pre-tokenizers. The pre-tokenizers are applied in the order they are provided in the configuration array. ```APIDOC ## SequencePreTokenizer ### Description Chains multiple pre-tokenizers together. ### Constructor `new SequencePreTokenizer(config: TokenizerConfigPreTokenizerSequence)` ### Parameters #### `config` (TokenizerConfigPreTokenizerSequence) - **pretokenizers** (TokenizerConfigPreTokenizer[]) - Required - Array of pre-tokenizer configs to apply in order. ### Example ```javascript import { SequencePreTokenizer } from "@huggingface/tokenizers"; const pretokenizer = new SequencePreTokenizer({ pretokenizers: [ { type: "ByteLevel", add_prefix_space: true }, { type: "Punctuation", behavior: "Isolated" } ] }); const tokens = pretokenizer("Hello, world!"); ``` ``` -------------------------------- ### Initialize CTCDecoder Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md Create a CTCDecoder for speech or audio models. Configure padding tokens, word delimiters, and cleanup behavior. ```typescript new CTCDecoder(config: TokenizerConfigDecoderCTC) ``` -------------------------------- ### Initialize ReplacePreTokenizer Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md Instantiate a pre-tokenizer that replaces patterns with specified content before splitting. Requires 'pattern' and 'content' configuration. ```typescript new ReplacePreTokenizer(config: TokenizerConfigPreTokenizerReplace) ``` -------------------------------- ### Access and Inspect Tokenizer Components Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/usage-patterns.md Provides access to individual components of a tokenizer, such as the normalizer, pre-tokenizer, model, post-processor, and decoder. Includes examples of testing these components and checking model vocabulary statistics. ```javascript // Access individual components const normalizer = tokenizer.normalizer; const preTokenizer = tokenizer.pre_tokenizer; const model = tokenizer.model; const postProcessor = tokenizer.post_processor; const decoder = tokenizer.decoder; // Test individual components if (normalizer) { const normalized = normalizer("Hello WORLD"); console.log(normalized); } if (preTokenizer) { const pretokens = preTokenizer("hello world"); console.log(pretokens); } if (model) { const encoded = model(["hello", "world"]); console.log(encoded); } // Check model vocabulary stats if (model) { console.log(`Vocabulary size: ${model.vocab.length}`); console.log(`Unknown token: ${model.unk_token}`); console.log(`Unknown token ID: ${model.unk_token_id}`); } ``` -------------------------------- ### BertProcessingPostProcessor Usage Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md Illustrates the basic usage of BertProcessingPostProcessor without explicit imports, suitable for environments where it's already available. ```javascript const postprocessor = new BertProcessingPostProcessor({ sep: ["[SEP]", 102], cls: ["[CLS]", 101] }); const result = postprocessor(["hello", "world"], null, true); // { tokens: ["[CLS]", "hello", "world", "[SEP]"], token_type_ids: [0, 0, 0, 0] } ``` -------------------------------- ### Instantiate SequencePreTokenizer Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md Instantiates a SequencePreTokenizer with an array of pre-tokenizer configurations. This is useful for creating custom tokenization pipelines by combining different pre-tokenization strategies. ```typescript new SequencePreTokenizer(config: TokenizerConfigPreTokenizerSequence) ``` ```javascript import { SequencePreTokenizer } from "@huggingface/tokenizers"; const pretokenizer = new SequencePreTokenizer({ pretokenizers: [ { type: "ByteLevel", add_prefix_space: true }, { type: "Punctuation", behavior: "Isolated" } ] }); const tokens = pretokenizer("Hello, world!"); ``` -------------------------------- ### Call PreTokenizer as a Function Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md Demonstrates calling a pre-tokenizer instance directly as a function to tokenize a string. This is the primary way to use pre-tokenizers after instantiation. ```javascript const pretokenizer = new WhitespacePreTokenizer({}); const tokens = pretokenizer("Hello world"); // ["Hello", "world"] ``` -------------------------------- ### Tokenization Pipeline Overview Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/INDEX.md Illustrates the complete flow of text processing within the tokenizers.js library, from input text to final output and reconstruction. ```text Input Text ↓ [Normalizer] ──→ Normalized text ↓ [PreTokenizer] ──→ Pre-tokens ↓ [Model] ──→ Sub-tokens ↓ [PostProcessor] ──→ Final tokens + token type IDs ↓ Output (IDs, tokens, attention mask) ↓ [Decoder] ──→ Reconstructed text ``` -------------------------------- ### Initialize WordPieceDecoder Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md Instantiate a WordPieceDecoder with custom prefix and cleanup options. The prefix is used to identify continuation tokens, and cleanup removes extra spaces. ```typescript new WordPieceDecoder(config: TokenizerConfigDecoderWordPiece) ``` -------------------------------- ### Tokenizer Decoding with Integrated Decoders Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md Demonstrates how the Tokenizer automatically uses its configured decoders during the decoding process. Special tokens can be skipped and tokenization spaces can be cleaned up using options. ```javascript const tokenizer = new Tokenizer(tokenizerJson, config); const tokens = tokenizer.encode("Hello world"); const decoded = tokenizer.decode(tokens.ids); // Uses the decoder from tokenizer.json automatically // "Hello world" // With options: const decoded_clean = tokenizer.decode(tokens.ids, { skip_special_tokens: true, clean_up_tokenization_spaces: true }); ``` -------------------------------- ### Initialize WordPiece Tokenizer Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Models.md Instantiate a WordPiece tokenizer with vocabulary and configuration. Words longer than max_input_chars_per_word are marked as unknown. Uses continuing_subword_prefix for non-initial subword pieces. ```typescript new WordPiece(config: TokenizerConfigWordPieceModel) ``` -------------------------------- ### Instantiate Tokenizer Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Tokenizer.md Create a new Tokenizer instance by providing tokenizer and configuration JSON objects. This is useful when loading a tokenizer from a model ID. ```typescript import { Tokenizer } from "@huggingface/tokenizers"; const modelId = "HuggingFaceTB/SmolLM3-3B"; const tokenizerJson = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer.json`) .then((res) => res.json()); const tokenizerConfig = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`) .then((res) => res.json()); const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig); ``` -------------------------------- ### Handle Unknown Component Type During Instantiation Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/errors.md When creating a tokenizer, ensure all component types (model, normalizer, pre-tokenizer, post-processor, decoder) specified in `tokenizerJson` are supported. Use valid types to prevent instantiation errors. ```javascript import { Tokenizer } from "@huggingface/tokenizers"; const tokenizerJson = { model: { type: "BPE", vocab: {}, merges: [] }, normalizer: { type: "InvalidNormalizer" } // Error! }; try { const tokenizer = new Tokenizer(tokenizerJson, {}); } catch (error) { console.error("Failed to create normalizer:", error); } ``` -------------------------------- ### Initialize Unigram Tokenizer Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Models.md Instantiate a Unigram tokenizer with model configuration and an end-of-sequence token. Uses Viterbi algorithm to find the most probable tokenization. ```typescript new Unigram(config: TokenizerConfigUnigramModel, eos_token: string) ``` -------------------------------- ### SentencePiece Unigram Tokenizer Configuration Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/configuration.md Set up a SentencePiece Unigram tokenizer using a precompiled character map, a Metaspace pre-tokenizer, and Unigram model specifics like vocab and unknown ID. ```json { "normalizer": { "type": "Sequence", "normalizers": [ { "type": "Precompiled", "precompiled_charsmap": "..." } ] }, "pre_tokenizer": { "type": "Metaspace", "replacement": "▁", "prepend_scheme": "first" }, "model": { "type": "Unigram", "vocab": [["..."]], "unk_id": 0 }, "decoder": { "type": "Metaspace", "replacement": "▁" } } ``` -------------------------------- ### Package Exports Configuration Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/exports.md Defines the main entry points for the package, specifying how different module systems (CommonJS, ES Modules) and environments (Node.js, browser) should import the library. This configuration is crucial for package managers and bundlers. ```json { "exports": { ".": { "types": "./types/index.d.ts", "node": { "require": "./dist/tokenizers.cjs", "import": "./dist/tokenizers.mjs" }, "browser": { "import": "./dist/tokenizers.mjs" }, "default": "./dist/tokenizers.mjs" } } } ```