### Example tokenizer.json

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/configuration.md

An example of a tokenizer.json file, illustrating the configuration for a BPE model with specific normalizer, pre-tokenizer, post-processor, and decoder settings.

```json
{
  "version": "1.0",
  "model": {
    "type": "BPE",
    "vocab": { "hello": 0, "world": 1 },
    "merges": ["h e", "l l"],
    "unk_token": "<unk>"
  },
  "normalizer": {
    "type": "Lowercase"
  },
  "pre_tokenizer": {
    "type": "ByteLevel",
    "add_prefix_space": true
  },
  "post_processor": {
    "type": "BertProcessing",
    "cls": ["[CLS]", 101],
    "sep": ["[SEP]", 102]
  },
  "decoder": {
    "type": "ByteLevel"
  }
}
```

--------------------------------

### Install Tokenizers.js via npm

Source: https://github.com/huggingface/tokenizers.js/blob/main/README.md

Install the library using npm for use in your project.

```bash
npm install @huggingface/tokenizers
```

--------------------------------

### Example tokenizer_config.json

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/configuration.md

An example of a tokenizer_config.json file, showing configurations for a BERT model including special tokens, model type, and processing flags.

```json
{
  "tokenizer_class": "PreTrainedTokenizerFast",
  "model_type": "bert",
  "name_or_path": "bert-base-uncased",
  "vocab_size": 30522,
  "model_max_length": 512,
  "do_lower_case": true,
  "tokenize_chinese_chars": true,
  "strip_accents": false,
  "cls_token": "[CLS]",
  "sep_token": "[SEP]",
  "unk_token": "[UNK]",
  "pad_token": "[PAD]",
  "mask_token": "[MASK]",
  "clean_up_tokenization_spaces": true
}
```

--------------------------------

### Example Usage of WordPiece Tokenizer

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Models.md

Demonstrates how to create and use a WordPiece tokenizer with a sample vocabulary and configuration.

```javascript
import { WordPiece } from "@huggingface/tokenizers";

const model = new WordPiece({
  vocab: {
    "hello": 0,
    "world": 1,
    "hel": 2,
    "##lo": 3,
    "<unk>": 100
  },
  unk_token: "<unk>",
  continuing_subword_prefix: "##",
  max_input_chars_per_word: 100
});

const tokens = model(["hello", "world"]);
// ["hello", "world"] or ["hel", "##lo", "world"] depending on vocab
```

--------------------------------

### Example Usage of MetaspacePreTokenizer

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md

Demonstrates how to instantiate and use MetaspacePreTokenizer to tokenize a string, showing the effect of the replacement and prepend_scheme configurations.

```javascript
import { MetaspacePreTokenizer } from "@huggingface/tokenizers";

const pretokenizer = new MetaspacePreTokenizer({
  replacement: "▁",
  prepend_scheme: "always"
});

const tokens = pretokenizer("Hello world");
// ["▁Hello", "▁world"]
```

--------------------------------

### Configure SequencePostProcessor

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md

Example of setting up a SequencePostProcessor that applies BertProcessing followed by ByteLevel processing. This is useful for complex tokenization pipelines.

```javascript
import { SequencePostProcessor } from "@huggingface/tokenizers";

const postprocessor = new SequencePostProcessor({
  processors: [
    {
      type: "BertProcessing",
      cls: ["[CLS]", 101],
      sep: ["[SEP]", 102]
    },
    {
      type: "ByteLevel",
      trim_offsets: true
    }
  ]
});

const result = postprocessor(["hello", "world"], null, true);
```

--------------------------------

### Post-Processor Import Usage

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/exports.md

Example of how to import PostProcessor and specific post-processing implementations from the @huggingface/tokenizers package.

```javascript
import {
  PostProcessor,
  BertProcessingPostProcessor,
  RobertaProcessingPostProcessor
} from "@huggingface/tokenizers";
```

--------------------------------

### Basic Decoder Usage Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md

Demonstrates the basic usage of a decoder by instantiating `ByteLevelDecoder` and calling it with a list of tokens to produce text.

```javascript
const decoder = new ByteLevelDecoder({});
const text = decoder(["Hello", "Ġworld"]);
// "Hello world"
```

--------------------------------

### Configure ByteLevelPostProcessor

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md

Example of creating a ByteLevelPostProcessor with specific options for adding a prefix space, trimming offsets, and enabling regex splitting.

```javascript
import { ByteLevelPostProcessor } from "@huggingface/tokenizers";

const postprocessor = new ByteLevelPostProcessor({
  add_prefix_space: true,
  trim_offsets: true,
  use_regex: true
});

const result = postprocessor(["Hello", "Ġworld"], null, false);
```

--------------------------------

### Import Pre-Tokenizers

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/exports.md

Demonstrates how to import specific pre-tokenizer classes from the @huggingface/tokenizers package. Ensure the package is installed before use.

```javascript
import {
  PreTokenizer,
  ByteLevelPreTokenizer,
  WhitespacePreTokenizer,
  BertPreTokenizer
} from "@huggingface/tokenizers";
```

--------------------------------

### Tokenization Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/types.md

Demonstrates how to use the tokenizer to encode a single string or a pair of strings, with options for returning token type IDs.

```javascript
const encoding = tokenizer.encode("Hello world");
// {
//   ids: [9906, 4435],
//   tokens: ['Hello', 'Ġworld'],
//   attention_mask: [1, 1]
// }
```

```javascript
const encoding_pair = tokenizer.encode("Hello world", {
  text_pair: "How are you?",
  return_token_type_ids: true
});
// {
//   ids: [101, 9906, 4435, 102, 2129, 2024, 2017, 102],
//   tokens: ['[CLS]', 'Hello', 'Ġworld', '[SEP]', 'How', 'are', 'you', '[SEP]'],
//   attention_mask: [1, 1, 1, 1, 1, 1, 1, 1],
//   token_type_ids: [0, 0, 0, 0, 1, 1, 1, 1]
// }
```

--------------------------------

### Model Import Usage

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/exports.md

Example of how to import Model and its implementations from the @huggingface/tokenizers package.

```javascript
import { Model, BPE, WordPiece, Unigram } from "@huggingface/tokenizers";
```

--------------------------------

### BertProcessingPostProcessor Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md

Demonstrates how to use BertProcessingPostProcessor for single and paired sequences. Ensure the correct import statement is used.

```javascript
import { BertProcessingPostProcessor } from "@huggingface/tokenizers";

const postprocessor = new BertProcessingPostProcessor({
  cls: ["[CLS]", 101],
  sep: ["[SEP]", 102]
});

// Single sequence
const result1 = postprocessor(["hello", "world"], null, true);
// { tokens: ["[CLS]", "hello", "world", "[SEP]"], token_type_ids: [0, 0, 0, 0] }

// Paired sequences
const result2 = postprocessor(["hello"], ["world"], true);
// { tokens: ["[CLS]", "hello", "[SEP]", "world", "[SEP]"], token_type_ids: [0, 0, 0, 1, 1] }
```

--------------------------------

### WhitespacePreTokenizer Usage Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md

Shows how to use WhitespacePreTokenizer to split a string by whitespace. It handles spaces, tabs, and newlines.

```javascript
import { WhitespacePreTokenizer } from "@huggingface/tokenizers";

const pretokenizer = new WhitespacePreTokenizer({});
const tokens = pretokenizer("Hello  world\ttab");
// ["Hello", "world", "tab"]
```

--------------------------------

### Configure and Use RobertaProcessingPostProcessor

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md

Example demonstrating how to instantiate and use the RobertaProcessingPostProcessor with custom CLS and SEP tokens, and offset trimming settings. The output includes the processed tokens and their type IDs.

```javascript
import { RobertaProcessingPostProcessor } from "@huggingface/tokenizers";

const postprocessor = new RobertaProcessingPostProcessor({
  cls: ["<s>", 0],
  sep: ["</s>", 2],
  trim_offsets: true,
  add_prefix_space: false
});

const result = postprocessor(["hello", "world"], null, true);
// { tokens: ["<s>", "hello", "world", "</s>"], token_type_ids: [0, 0, 0, 0] }
```

--------------------------------

### Import All Main Components (JavaScript)

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/exports.md

Imports all core components, normalizers, pre-tokenizers, models, post-processors, and decoders from the library. Ensure you have the library installed.

```javascript
import {
  // Core
  Tokenizer,
  AddedToken,
  
  // Types
  Encoding,
  
  // Normalizers
  BertNormalizer,
  LowercaseNormalizer,
  NFKCNormalizer,
  
  // Pre-Tokenizers
  ByteLevelPreTokenizer,
  WhitespacePreTokenizer,
  BertPreTokenizer,
  
  // Models
  BPE,
  WordPiece,
  Unigram,
  
  // Post-Processors
  BertProcessingPostProcessor,
  RobertaProcessingPostProcessor,
  
  // Decoders
  ByteLevelDecoder,
  WordPieceDecoder,
  BPEDecoder
} from "@huggingface/tokenizers";
```

--------------------------------

### Use FixedLengthPreTokenizer

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md

Example of splitting a string into chunks of length 4 using FixedLengthPreTokenizer.

```javascript
import { FixedLengthPreTokenizer } from "@huggingface/tokenizers";

const pretokenizer = new FixedLengthPreTokenizer({
  length: 4
});

const tokens = pretokenizer("helloworld");
// ["hell", "o", "worl", "d"]
```

--------------------------------

### Create and Use SequenceDecoder

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md

Example of creating a SequenceDecoder with multiple nested decoders and applying it to tokens. Ensure the necessary decoders are imported.

```javascript
import { SequenceDecoder } from "@huggingface/tokenizers";

const decoder = new SequenceDecoder({
  decoders: [
    { type: "ByteLevel", trim_offsets: true },
    { type: "Fuse" },
    { type: "Strip", content: " ", start: 0, stop: 1 }
  ]
});

const text = decoder(tokens);
```

--------------------------------

### Example Usage of PunctuationPreTokenizer

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md

Shows how to create and use PunctuationPreTokenizer to tokenize a string containing punctuation, illustrating the 'Isolated' behavior.

```javascript
import { PunctuationPreTokenizer } from "@huggingface/tokenizers";

const pretokenizer = new PunctuationPreTokenizer({
  behavior: "Isolated"
});

const tokens = pretokenizer("Hello, world!");
// ["Hello", ",", "world", "!"]
```

--------------------------------

### Use CharDelimiterSplitPreTokenizer

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md

Example of splitting a string by a comma delimiter using CharDelimiterSplitPreTokenizer.

```javascript
import { CharDelimiterSplitPreTokenizer } from "@huggingface/tokenizers";

const pretokenizer = new CharDelimiterSplitPreTokenizer({
  delimiter: ","
});

const tokens = pretokenizer("apple,banana,cherry");
// ["apple", "banana", "cherry"]
```

--------------------------------

### ByteLevelPreTokenizer Usage Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md

Demonstrates how to use ByteLevelPreTokenizer to tokenize a string. It converts text to byte representations before tokenization, useful for GPT-2 style tokenization.

```javascript
import { ByteLevelPreTokenizer } from "@huggingface/tokenizers";

const pretokenizer = new ByteLevelPreTokenizer({
  add_prefix_space: true,
  use_regex: true
});

const tokens = pretokenizer("Hello world");
// ['Hello', 'Ġworld'] (Ġ represents byte encoding of space)
```

--------------------------------

### Use ReplacePreTokenizer

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md

Example of replacing a hyphen with a space using ReplacePreTokenizer.

```javascript
import { ReplacePreTokenizer } from "@huggingface/tokenizers";

const pretokenizer = new ReplacePreTokenizer({
  pattern: { String: "-" },
  content: " "
});

const tokens = pretokenizer("hello-world");
// ["hello", "world"]
```

--------------------------------

### PostProcessor Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/CoreInterfaces.md

Demonstrates using a `PostProcessor` component, extending `Callable`, to format token arrays for specific model inputs. It accepts tokens and optional arguments.

```typescript
const processor = new BertProcessingPostProcessor(config);
const result = processor(tokens, null, true);
// Returns { tokens, token_type_ids }
```

--------------------------------

### Example Usage of Unigram Tokenizer

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Models.md

Demonstrates how to create and use a Unigram tokenizer with a sample vocabulary and an end-of-sequence token. Automatically fuses unknown tokens.

```javascript
import { Unigram } from "@huggingface/tokenizers";

const model = new Unigram({
  vocab: [
    ["<unk>", -10],
    ["hello", 0],
    ["world", 0],
    ["he", -5],
    ["llo", -5]
  ],
  unk_id: 0
}, "</s>");

const tokens = model(["hello", "world"]);
```

--------------------------------

### Incorrect Import from Submodule

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/exports.md

This example shows the incorrect way to import a component by referencing an internal submodule. This method should be avoided.

```javascript
// ❌ Wrong - don't import from submodules
import BertNormalizer from "@huggingface/tokenizers/core/normalizer/BertNormalizer";
```

--------------------------------

### PreTokenizer Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/CoreInterfaces.md

Shows how a `PreTokenizer` component, extending `Callable`, is used to split input strings into an array of tokens. It accepts a string or an array of strings as input.

```typescript
const pretokenizer = new WhitespacePreTokenizer({});
const tokens = pretokenizer("hello world"); // Returns ["hello", "world"]
```

--------------------------------

### Model Encoding Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Models.md

Demonstrates how to encode pre-tokenized strings using a model. The output can vary based on the vocabulary and model type, potentially splitting tokens into subwords.

```javascript
const pretokens = ["hello", "world"];
const encoded = model(pretokens);
// encoded might be ["hello", "world"] or
// ["hel", "##lo", "world"] depending on vocab and model type
```

--------------------------------

### AddedToken Behavior Rules

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/AddedToken.md

Explains the behavior rules for AddedToken, focusing on whitespace stripping and normalization, and provides examples.

```APIDOC
## Behavior Rules

**Whitespace Stripping:**
- When `lstrip=true` and token appears after other content, preceding whitespace is trimmed
- When `rstrip=true` and token appears before other content, following whitespace is trimmed
- Useful for preventing spaces around special tokens like "[MASK]"

**Normalization:**
- By default, special tokens are not normalized (`normalized=false`)
- Regular added tokens are normalized by default (`normalized=true`)
- Override with explicit `normalized` value

**Example with stripping:**
```javascript
const mask_token = new AddedToken({
  content: "[MASK]",
  id: 103,
  special: true,
  lstrip: true,
  rstrip: true
});

// With this config, " [MASK] " becomes "[MASK]"
// The surrounding spaces are handled by the tokenizer
```
```

--------------------------------

### BPE Unknown Token Handling Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/errors.md

Demonstrates how the `fuse_unk` flag affects the merging of consecutive unknown tokens. Ensure the `BPE` class is imported.

```javascript
import { BPE } from "@huggingface/tokenizers";

const model = new BPE({
  vocab: { "hello": 0, "<unk>": 100 },
  merges: [],
  unk_token: "<unk>",
  fuse_unk: true,
  byte_fallback: false
});

// Unknown word behavior depends on fuse_unk flag
const result = model(["unknown_word"]);
```

--------------------------------

### TokenizerModel Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/CoreInterfaces.md

Illustrates the usage of a `TokenizerModel` (e.g., BPE) which extends `Callable`. It takes an array of strings and returns an array of encoded tokens.

```typescript
const model = new BPE(config);
const encoded = model(["hello", "world"]); // Returns encoded tokens
```

--------------------------------

### Use FuseDecoder to join tokens

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md

Example of instantiating FuseDecoder and using it to concatenate tokens into a single string without spaces.

```javascript
import { FuseDecoder } from "@huggingface/tokenizers";

const decoder = new FuseDecoder({});

const text = decoder(["h", "e", "l", "l", "o"]);
// "hello"
```

--------------------------------

### Use ByteFallbackDecoder for byte-encoded content

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md

Example of instantiating ByteFallbackDecoder to process tokens that may include byte-encoded fallback content.

```javascript
import { ByteFallbackDecoder } from "@huggingface/tokenizers";

const decoder = new ByteFallbackDecoder({});

const text = decoder(tokens);
// Handles byte-encoded fallback content
```

--------------------------------

### Configure and Use TemplateProcessingPostProcessor with GPT-2 Template

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md

Example of setting up the TemplateProcessingPostProcessor using a GPT-2 style template. This configuration defines how single and paired sequences are processed, including the use of a custom separator token.

```javascript
import { TemplateProcessingPostProcessor } from "@huggingface/tokenizers";

const postprocessor = new TemplateProcessingPostProcessor({
  single: [
    { Sequence: { id: "A" } }
  ],
  pair: [
    { Sequence: { id: "A" } },
    { SpecialToken: { id: "sep", ids: [50256], tokens: ["<|endoftext|>"] } },
    { Sequence: { id: "B" } }
  ],
  special_tokens: {
    sep: {
      id: "sep",
      ids: [50256],
      tokens: ["<|endoftext|>"]
    }
  }
});

const result = postprocessor(["hello"], ["world"], true);
// { tokens: ["hello", "<|endoftext|>", "world"] }
```

--------------------------------

### Correct Named Export Import

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/exports.md

This example demonstrates the correct way to import components using named exports directly from the main package. This is the recommended approach for proper tree-shaking and bundling.

```javascript
// ✓ Correct
import { BertNormalizer } from "@huggingface/tokenizers";
```

--------------------------------

### NFKDNormalizer Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Normalizers.md

Applies Unicode Normalization Form KD (Compatibility Decomposition). Use this to convert compatibility characters into their canonical equivalents.

```javascript
import { NFKDNormalizer } from "@huggingface/tokenizers";

const normalizer = new NFKDNormalizer({});
const result = normalizer("ℌello");
// "Hello" (compatibility characters converted)
```

--------------------------------

### Normalizer Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/CoreInterfaces.md

Demonstrates using a `Normalizer` component, which extends `Callable`, to normalize input strings. The component is invoked directly like a function.

```typescript
const normalizer = new LowercaseNormalizer({});
const normalized = normalizer("Hello"); // Returns "hello"
```

--------------------------------

### NFKCNormalizer Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Normalizers.md

Applies Unicode Normalization Form KC (Compatibility Composition). Use this to convert compatibility characters into their composed canonical equivalents.

```javascript
import { NFKCNormalizer } from "@huggingface/tokenizers";

const normalizer = new NFKCNormalizer({});
const result = normalizer("ℌello");
// "Hello" (compatibility form)
```

--------------------------------

### Tokenizer Encoding with Post-Processing

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md

Demonstrates how a tokenizer's encode method utilizes post-processors to add special tokens and generate token type IDs. This is a typical integration example.

```javascript
const tokenizer = new Tokenizer(tokenizerJson, config);

// During encoding:
const encoded = tokenizer.encode("Hello world", {
  text_pair: "How are you?",
  add_special_tokens: true,
  return_token_type_ids: true
});

// The post-processor adds special tokens and token type IDs:
// encoded = {
//   ids: [101, 7592, 2088, 102, 2129, 2024, 2017, 102],
//   tokens: ["[CLS]", "Hello", "world", "[SEP]", "How", "are", "you", "[SEP]"],
//   attention_mask: [1, 1, 1, 1, 1, 1, 1, 1],
//   token_type_ids: [0, 0, 0, 0, 1, 1, 1, 1]
// }
```

--------------------------------

### Instantiate StripDecoder

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md

Use StripDecoder to remove a specified substring from the start and/or end of concatenated tokens. Configure with content, start, and stop positions.

```typescript
new StripDecoder(config: TokenizerConfigDecoderStrip)
```

--------------------------------

### WhitespaceSplitPreTokenizer Usage Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md

Demonstrates the usage of WhitespaceSplitPreTokenizer to split a string by spaces only. It does not split on tabs or newlines.

```javascript
import { WhitespaceSplitPreTokenizer } from "@huggingface/tokenizers";

const pretokenizer = new WhitespaceSplitPreTokenizer({});
const tokens = pretokenizer("Hello world");
// ["Hello", "world"]
```

--------------------------------

### Initialize ByteLevelPostProcessor

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md

Instantiate a ByteLevelPostProcessor with custom configuration. Use this to control prefix space addition, offset trimming, and regex usage for byte-level tokenization.

```typescript
new ByteLevelPostProcessor(config: TokenizerConfigPostProcessorByteLevel)
```

--------------------------------

### Initialize and Use Tokenizer in JavaScript

Source: https://github.com/huggingface/tokenizers.js/blob/main/README.md

Load tokenizer configuration files from the Hugging Face Hub and initialize a Tokenizer instance. This snippet demonstrates tokenizing, encoding, and decoding text.

```javascript
import { Tokenizer } from "@huggingface/tokenizers";

// Load files from the Hugging Face Hub
const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer.json`).then((res) => res.json());
const tokenizerConfig = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`).then((res) => res.json());

// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);

// Tokenize text
const tokens = tokenizer.tokenize("Hello World"); // ['Hello', 'ĠWorld']
const encoded = tokenizer.encode("Hello World"); // { ids: [9906, 4435], tokens: ['Hello', 'ĠWorld'], attention_mask: [1, 1] }
const decoded = tokenizer.decode(encoded.ids); // 'Hello World'
```

--------------------------------

### Initialize and Use BertPreTokenizer

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md

Shows how to instantiate and use the BertPreTokenizer, which splits text based on whitespace and punctuation, isolating punctuation as separate tokens. Requires importing the class.

```typescript
import { BertPreTokenizer } from "@huggingface/tokenizers";

const pretokenizer = new BertPreTokenizer({});
const tokens = pretokenizer("Hello, world!");
// ["Hello", ",", "world", "!"]
```

--------------------------------

### BPE Model Initialization and Usage

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Models.md

Initializes a Byte-Pair Encoding model with vocabulary and merge rules. Demonstrates encoding pre-tokenized strings and clearing the internal cache.

```javascript
import { BPE } from "@huggingface/tokenizers";

const model = new BPE({
  vocab: { "hello": 0, "world": 1, "he": 2, "llo": 3 },
  merges: ["h e", "l l"],
  unk_token: "<unk>",
  end_of_word_suffix: "</w>"
});

const tokens = model(["hello", "world"]);
// Returns merged tokens: ["he", "llo</w>", "world</w>"]

model.clear_cache(); // Free memory
```

--------------------------------

### Initialize SequencePostProcessor

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md

Create a SequencePostProcessor to chain multiple post-processing steps. Configure it with an array of processor definitions.

```typescript
new SequencePostProcessor(config: TokenizerConfigPostProcessorSequence)
```

--------------------------------

### Use StripDecoder to remove leading/trailing spaces

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md

Example of using StripDecoder to remove leading and trailing space characters from a list of tokens.

```javascript
import { StripDecoder } from "@huggingface/tokenizers";

const decoder = new StripDecoder({
  content: " ",
  start: 0,
  stop: 1
});

const text = decoder([" ", "hello", " ", "world"]);
// "hello world" (leading/trailing spaces stripped)
```

--------------------------------

### Initialize Tokenizer from Local Files

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/usage-patterns.md

Loads tokenizer configuration and vocabulary from local files within a specified directory. Requires Node.js `fs/promises` module.

```javascript
import { Tokenizer } from "@huggingface/tokenizers";
import fs from "fs/promises";

async function initTokenizerLocal(directory) {
  const tokenizerJson = JSON.parse(
    await fs.readFile(`${directory}/tokenizer.json`, "utf-8")
  );
  const tokenizerConfig = JSON.parse(
    await fs.readFile(`${directory}/tokenizer_config.json`, "utf-8")
  );

  return new Tokenizer(tokenizerJson, tokenizerConfig);
}
```

--------------------------------

### NFCNormalizer Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Normalizers.md

Applies Unicode Normalization Form C (Canonical Composition). Use this to compose decomposed characters into their precomposed forms.

```javascript
import { NFCNormalizer } from "@huggingface/tokenizers";

const normalizer = new NFCNormalizer({});
const result = normalizer("é");
// "é" (base + diacritic composed into single character)
```

--------------------------------

### NFDNormalizer Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Normalizers.md

Applies Unicode Normalization Form D (Canonical Decomposition). Use this to decompose characters into their base and combining diacritic forms.

```javascript
import { NFDNormalizer } from "@huggingface/tokenizers";

const normalizer = new NFDNormalizer({});
const result = normalizer("é");
// "é" (decomposed into base + combining diacritic)
```

--------------------------------

### ByteLevelDecoder Initialization and Usage

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md

Initializes a `ByteLevelDecoder` with specific configuration options and demonstrates its use in converting byte-level encoded tokens back to UTF-8 text.

```javascript
import { ByteLevelDecoder } from "@huggingface/tokenizers";

const decoder = new ByteLevelDecoder({
  add_prefix_space: false,
  trim_offsets: false,
  use_regex: true
});

// Assuming tokens contain byte-level encoded text
const text = decoder(["Hello", "Ġworld"]);
// "Hello world"
```

--------------------------------

### Get Vocabulary Map

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Tokenizer.md

Retrieves the complete vocabulary as a Map, mapping token strings to their IDs. Can optionally include user-added tokens.

```typescript
get_vocab(with_added_tokens?: boolean): Map<string, number>
```

```javascript
const vocab = tokenizer.get_vocab();
const vocab_size = vocab.size;

const vocab_no_added = tokenizer.get_vocab(false);
```

--------------------------------

### Initialize RobertaProcessingPostProcessor

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md

Use this post-processor for RoBERTa-style tokenization, which includes specific configurations for CLS and SEP tokens. It handles trimming offsets and adding prefix spaces based on the provided configuration.

```typescript
new RobertaProcessingPostProcessor(config: TokenizerConfigPostProcessorRoberta)
```

--------------------------------

### Load Tokenizer from Hugging Face Hub

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/README.md

Load tokenizer configuration files (tokenizer.json and tokenizer_config.json) from the Hugging Face Hub to initialize a Tokenizer instance.

```javascript
import { Tokenizer } from "@huggingface/tokenizers";

const modelId = "HuggingFaceTB/SmolLM3-3B";

const tokenizerJson = await fetch(
  `https://huggingface.co/${modelId}/resolve/main/tokenizer.json`
).then(res => res.json());

const tokenizerConfig = await fetch(
  `https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`
).then(res => res.json());

const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);
```

--------------------------------

### BertNormalizer Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Normalizers.md

Instantiates and uses BertNormalizer with various configuration options for text cleaning, CJK character spacing, lowercasing, and accent stripping.

```javascript
import { BertNormalizer } from "@huggingface/tokenizers";

const normalizer = new BertNormalizer({
  clean_text: true,
  handle_chinese_chars: true,
  lowercase: true,
  strip_accents: true
});

const result = normalizer("Hello Wörld");
// "hello world" (accents removed, lowercased)
```

--------------------------------

### StripDecoder

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md

Strips a specified substring from the start and/or end of concatenated tokens. It allows for precise control over which parts of the tokenized string are removed.

```APIDOC
## StripDecoder

### Description
Strips a specified substring from the start and/or end of concatenated tokens. It allows for precise control over which parts of the tokenized string are removed.

### Constructor
```typescript
new StripDecoder(config: TokenizerConfigDecoderStrip)
```

### Configuration
| Config | Type | Description |
|--------|------|-------------|
| content | string | Content to search for and remove |
| start | number | Start position for stripping |
| stop | number | End position for stripping |

### Example
```javascript
import { StripDecoder } from "@huggingface/tokenizers";

const decoder = new StripDecoder({
  content: " ",
  start: 0,
  stop: 1
});

const text = decoder([" ", "hello", " ", "world"]);
// "hello world" (leading/trailing spaces stripped)
```
```

--------------------------------

### ByteLevelPreTokenizer Initialization

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md

Initializes the ByteLevelPreTokenizer with configuration options. Use this for BPE models requiring byte-level processing.

```typescript
new ByteLevelPreTokenizer(config: TokenizerConfigPreTokenizerByteLevel)
```

--------------------------------

### Initialize Tokenizer from Hugging Face Hub

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/usage-patterns.md

Loads tokenizer configuration and vocabulary from the Hugging Face Hub using a model ID. Requires an asynchronous context.

```javascript
import { Tokenizer } from "@huggingface/tokenizers";

async function initTokenizer(modelId) {
  // Fetch tokenizer files from Hugging Face Hub
  const tokenizerJson = await fetch(
    `https://huggingface.co/${modelId}/resolve/main/tokenizer.json`
  ).then(res => res.json());

  const tokenizerConfig = await fetch(
    `https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`
  ).then(res => res.json());

  // Create tokenizer
  return new Tokenizer(tokenizerJson, tokenizerConfig);
}

// Usage
const tokenizer = await initTokenizer("HuggingFaceTB/SmolLM3-3B");
```

--------------------------------

### Decoder Example

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/CoreInterfaces.md

Shows how a `Decoder` component, extending `Callable`, reconstructs text from an array of tokens. It takes the token array and returns the decoded string.

```typescript
const decoder = new ByteLevelDecoder({});
const text = decoder(tokens); // Returns reconstructed text
```

--------------------------------

### RobertaProcessingPostProcessor

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md

Initializes a RoBERTa-style post-processor. It's similar to BERT's post-processor but with a different special token configuration.

```APIDOC
## RobertaProcessingPostProcessor

### Description
Initializes a RoBERTa-style post-processor, similar to BERT but with a different special token configuration.

### Constructor
`new RobertaProcessingPostProcessor(config: TokenizerConfigPostProcessorRoberta)`

### Parameters
#### Config
- **cls** (string | number) - CLS token and its ID.
- **sep** (string | number) - SEP token and its ID.
- **trim_offsets** (boolean) - Trim offsets to remove whitespace (default: true).
- **add_prefix_space** (boolean) - Add space prefix to first token (default: false).

### Example
```javascript
import { RobertaProcessingPostProcessor } from "@huggingface/tokenizers";

const postprocessor = new RobertaProcessingPostProcessor({
  cls: ["<s>", 0],
  sep: ["</s>", 2],
  trim_offsets: true,
  add_prefix_space: false
});

const result = postprocessor(["hello", "world"], null, true);
// { tokens: ["<s>", "hello", "world", "</s>"], token_type_ids: [0, 0, 0, 0] }
```
```

--------------------------------

### Initialize FixedLengthPreTokenizer

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md

Instantiate a pre-tokenizer that splits text into fixed-length chunks. Requires a 'length' configuration.

```typescript
new FixedLengthPreTokenizer(config: TokenizerConfigPreTokenizerFixedLength)
```

--------------------------------

### Get Added Tokens Decoder

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Tokenizer.md

Returns a Map that decodes user-added token IDs back to their AddedToken objects, containing properties like content.

```typescript
get_added_tokens_decoder(): Map<number, AddedToken>
```

```javascript
const added_tokens = tokenizer.get_added_tokens_decoder();
for (const [id, token] of added_tokens) {
  console.log(`ID ${id}: ${token.content}`);
}
```

--------------------------------

### Initialize TemplateProcessingPostProcessor

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md

This post-processor offers maximum flexibility using a template syntax for arranging tokens. It supports configurations for single and paired sequences, along with custom special tokens.

```typescript
new TemplateProcessingPostProcessor(config: TokenizerConfigPostProcessorTemplateProcessing)
```

--------------------------------

### Normalizer Base Class Usage

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Normalizers.md

Demonstrates how to instantiate a normalizer (e.g., BertNormalizer) and call it to normalize text.

```javascript
const normalizer = new BertNormalizer(config);
const normalized = normalizer("Hello  WORLD");
```

--------------------------------

### Load Tokenizers.js via CDN

Source: https://github.com/huggingface/tokenizers.js/blob/main/README.md

Include the library in your HTML using a CDN for browser-based applications.

```html
<script type="module">
  import { Tokenizer } from "https://cdn.jsdelivr.net/npm/@huggingface/tokenizers";
</script>
```

--------------------------------

### Get Complete Vocabulary

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/usage-patterns.md

Retrieve the entire vocabulary of the tokenizer, including any added special tokens. The size of the vocabulary can be logged, and the vocabulary can be converted to an array for iteration.

```javascript
// Get all tokens with their IDs
const vocab = tokenizer.get_vocab();
console.log(vocab.size); // Total vocabulary size

// Convert to array for iteration
const vocabArray = Array.from(vocab).slice(0, 10);
console.log(vocabArray);
// [["hello", 0], ["world", 1], ...]
```

--------------------------------

### Tokenizer Constructor

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Tokenizer.md

Creates a new Tokenizer instance that orchestrates the tokenization pipeline. It takes tokenizer configuration and additional settings as input.

```APIDOC
## Constructor Tokenizer

### Description
Creates a new Tokenizer instance that orchestrates the tokenization pipeline. It takes tokenizer configuration and additional settings as input.

### Method
```typescript
new Tokenizer(tokenizer: Object, config: Object)
```

### Parameters
#### Path Parameters
- **tokenizer** (Object) - Required - Tokenizer configuration from `tokenizer.json`. Must include properties: `model`, `decoder`, `post_processor`, `pre_tokenizer`, `normalizer`
- **config** (Object) - Required - Additional configuration from `tokenizer_config.json`. Can include special tokens, processing flags, and custom settings

### Throws
- Error if tokenizer object is missing required properties
- Error if config object is invalid or missing

### Example
```javascript
import { Tokenizer } from "@huggingface/tokenizers";

const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer.json`)
  .then((res) => res.json());
const tokenizerConfig = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`)
  .then((res) => res.json());

const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);
```
```

--------------------------------

### Get Vocabulary Without Added Tokens

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/usage-patterns.md

Retrieve only the original model vocabulary, excluding any user-added special tokens. This is useful when you need to work with the base vocabulary of the pre-trained model.

```javascript
// Original model vocabulary only
const modelVocab = tokenizer.get_vocab(false);
// Excludes user-added special tokens
```

--------------------------------

### SequencePreTokenizer Constructor

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md

Initializes a new SequencePreTokenizer by chaining multiple pre-tokenizers. The pre-tokenizers are applied in the order they are provided in the configuration array.

```APIDOC
## SequencePreTokenizer

### Description
Chains multiple pre-tokenizers together.

### Constructor
`new SequencePreTokenizer(config: TokenizerConfigPreTokenizerSequence)`

### Parameters
#### `config` (TokenizerConfigPreTokenizerSequence)
- **pretokenizers** (TokenizerConfigPreTokenizer[]) - Required - Array of pre-tokenizer configs to apply in order.

### Example
```javascript
import { SequencePreTokenizer } from "@huggingface/tokenizers";

const pretokenizer = new SequencePreTokenizer({
  pretokenizers: [
    { type: "ByteLevel", add_prefix_space: true },
    { type: "Punctuation", behavior: "Isolated" }
  ]
});

const tokens = pretokenizer("Hello, world!");
```
```

--------------------------------

### Initialize CTCDecoder

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md

Create a CTCDecoder for speech or audio models. Configure padding tokens, word delimiters, and cleanup behavior.

```typescript
new CTCDecoder(config: TokenizerConfigDecoderCTC)
```

--------------------------------

### Initialize ReplacePreTokenizer

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md

Instantiate a pre-tokenizer that replaces patterns with specified content before splitting. Requires 'pattern' and 'content' configuration.

```typescript
new ReplacePreTokenizer(config: TokenizerConfigPreTokenizerReplace)
```

--------------------------------

### Access and Inspect Tokenizer Components

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/usage-patterns.md

Provides access to individual components of a tokenizer, such as the normalizer, pre-tokenizer, model, post-processor, and decoder. Includes examples of testing these components and checking model vocabulary statistics.

```javascript
// Access individual components
const normalizer = tokenizer.normalizer;
const preTokenizer = tokenizer.pre_tokenizer;
const model = tokenizer.model;
const postProcessor = tokenizer.post_processor;
const decoder = tokenizer.decoder;

// Test individual components
if (normalizer) {
  const normalized = normalizer("Hello WORLD");
  console.log(normalized);
}

if (preTokenizer) {
  const pretokens = preTokenizer("hello world");
  console.log(pretokens);
}

if (model) {
  const encoded = model(["hello", "world"]);
  console.log(encoded);
}

// Check model vocabulary stats
if (model) {
  console.log(`Vocabulary size: ${model.vocab.length}`);
  console.log(`Unknown token: ${model.unk_token}`);
  console.log(`Unknown token ID: ${model.unk_token_id}`);
}
```

--------------------------------

### BertProcessingPostProcessor Usage

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PostProcessors.md

Illustrates the basic usage of BertProcessingPostProcessor without explicit imports, suitable for environments where it's already available.

```javascript
const postprocessor = new BertProcessingPostProcessor({
  sep: ["[SEP]", 102],
  cls: ["[CLS]", 101]
});

const result = postprocessor(["hello", "world"], null, true);
// { tokens: ["[CLS]", "hello", "world", "[SEP]"], token_type_ids: [0, 0, 0, 0] }
```

--------------------------------

### Instantiate SequencePreTokenizer

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md

Instantiates a SequencePreTokenizer with an array of pre-tokenizer configurations. This is useful for creating custom tokenization pipelines by combining different pre-tokenization strategies.

```typescript
new SequencePreTokenizer(config: TokenizerConfigPreTokenizerSequence)
```

```javascript
import { SequencePreTokenizer } from "@huggingface/tokenizers";

const pretokenizer = new SequencePreTokenizer({
  pretokenizers: [
    { type: "ByteLevel", add_prefix_space: true },
    { type: "Punctuation", behavior: "Isolated" }
  ]
});

const tokens = pretokenizer("Hello, world!");
```

--------------------------------

### Call PreTokenizer as a Function

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/PreTokenizers.md

Demonstrates calling a pre-tokenizer instance directly as a function to tokenize a string. This is the primary way to use pre-tokenizers after instantiation.

```javascript
const pretokenizer = new WhitespacePreTokenizer({});
const tokens = pretokenizer("Hello world");
// ["Hello", "world"]
```

--------------------------------

### Tokenization Pipeline Overview

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/INDEX.md

Illustrates the complete flow of text processing within the tokenizers.js library, from input text to final output and reconstruction.

```text
Input Text
    ↓
[Normalizer] ──→ Normalized text
    ↓
[PreTokenizer] ──→ Pre-tokens
    ↓
[Model] ──→ Sub-tokens
    ↓
[PostProcessor] ──→ Final tokens + token type IDs
    ↓
Output (IDs, tokens, attention mask)
    ↓
[Decoder] ──→ Reconstructed text
```

--------------------------------

### Initialize WordPieceDecoder

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md

Instantiate a WordPieceDecoder with custom prefix and cleanup options. The prefix is used to identify continuation tokens, and cleanup removes extra spaces.

```typescript
new WordPieceDecoder(config: TokenizerConfigDecoderWordPiece)
```

--------------------------------

### Tokenizer Decoding with Integrated Decoders

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Decoders.md

Demonstrates how the Tokenizer automatically uses its configured decoders during the decoding process. Special tokens can be skipped and tokenization spaces can be cleaned up using options.

```javascript
const tokenizer = new Tokenizer(tokenizerJson, config);

const tokens = tokenizer.encode("Hello world");
const decoded = tokenizer.decode(tokens.ids);
// Uses the decoder from tokenizer.json automatically
// "Hello world"

// With options:
const decoded_clean = tokenizer.decode(tokens.ids, {
  skip_special_tokens: true,
  clean_up_tokenization_spaces: true
});
```

--------------------------------

### Initialize WordPiece Tokenizer

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Models.md

Instantiate a WordPiece tokenizer with vocabulary and configuration. Words longer than max_input_chars_per_word are marked as unknown. Uses continuing_subword_prefix for non-initial subword pieces.

```typescript
new WordPiece(config: TokenizerConfigWordPieceModel)
```

--------------------------------

### Instantiate Tokenizer

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Tokenizer.md

Create a new Tokenizer instance by providing tokenizer and configuration JSON objects. This is useful when loading a tokenizer from a model ID.

```typescript
import { Tokenizer } from "@huggingface/tokenizers";

const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer.json`)
  .then((res) => res.json());
const tokenizerConfig = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`)
  .then((res) => res.json());

const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);
```

--------------------------------

### Handle Unknown Component Type During Instantiation

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/errors.md

When creating a tokenizer, ensure all component types (model, normalizer, pre-tokenizer, post-processor, decoder) specified in `tokenizerJson` are supported. Use valid types to prevent instantiation errors.

```javascript
import { Tokenizer } from "@huggingface/tokenizers";

const tokenizerJson = {
  model: { type: "BPE", vocab: {}, merges: [] },
  normalizer: { type: "InvalidNormalizer" } // Error!
};

try {
  const tokenizer = new Tokenizer(tokenizerJson, {});
} catch (error) {
  console.error("Failed to create normalizer:", error);
}
```

--------------------------------

### Initialize Unigram Tokenizer

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/api-reference/Models.md

Instantiate a Unigram tokenizer with model configuration and an end-of-sequence token. Uses Viterbi algorithm to find the most probable tokenization.

```typescript
new Unigram(config: TokenizerConfigUnigramModel, eos_token: string)
```

--------------------------------

### SentencePiece Unigram Tokenizer Configuration

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/configuration.md

Set up a SentencePiece Unigram tokenizer using a precompiled character map, a Metaspace pre-tokenizer, and Unigram model specifics like vocab and unknown ID.

```json
{
  "normalizer": {
    "type": "Sequence",
    "normalizers": [
      { "type": "Precompiled", "precompiled_charsmap": "..." }
    ]
  },
  "pre_tokenizer": {
    "type": "Metaspace",
    "replacement": "▁",
    "prepend_scheme": "first"
  },
  "model": {
    "type": "Unigram",
    "vocab": [["..."]],
    "unk_id": 0
  },
  "decoder": {
    "type": "Metaspace",
    "replacement": "▁"
  }
}
```

--------------------------------

### Package Exports Configuration

Source: https://github.com/huggingface/tokenizers.js/blob/main/_autodocs/exports.md

Defines the main entry points for the package, specifying how different module systems (CommonJS, ES Modules) and environments (Node.js, browser) should import the library. This configuration is crucial for package managers and bundlers.

```json
{
  "exports": {
    ".": {
      "types": "./types/index.d.ts",
      "node": {
        "require": "./dist/tokenizers.cjs",
        "import": "./dist/tokenizers.mjs"
      },
      "browser": {
        "import": "./dist/tokenizers.mjs"
      },
      "default": "./dist/tokenizers.mjs"
    }
  }
}
```