### Compile llama.cpp Binary for Wllama (Shell)

Source: https://github.com/ngxson/wllama/blob/master/README.md

Provides shell commands to compile the llama.cpp binary required by Wllama from its source code. This process involves cloning the repository, optionally updating the llama.cpp submodule, installing npm dependencies, building the WebAssembly module, and finally building the ES module.

```shell
# Clone the repository with submodule
git clone --recurse-submodules https://github.com/ngxson/wllama.git
cd wllama

# Optionally, you can run this command to update llama.cpp to latest upstream version (bleeding-edge, use with your own risk!)
# git submodule update --remote --merge

# Install the required modules
npm i

# Firstly, build llama.cpp into wasm
npm run build:wasm
# Then, build ES module
npm run build
```

--------------------------------

### Install wllama in React TypeScript Project

Source: https://github.com/ngxson/wllama/blob/master/README.md

Install the wllama package using npm in your React TypeScript project. This command adds the necessary library to your project's dependencies, allowing you to import and use the Wllama module.

```bash
npm i @wllama/wllama
```

--------------------------------

### Get Logits for Custom Sampling with Wllama JavaScript

Source: https://context7.com/ngxson/wllama/llms.txt

This snippet demonstrates how to retrieve token probabilities (logits) from the Wllama model to implement custom sampling strategies. It involves tokenizing input, initializing sampling, accepting tokens, decoding, getting logits, and then selecting a token based on these logits.

```javascript
const tokens = await wllama.tokenize("The color of the sky is", true);
await wllama.kvClear();
await wllama.samplingInit({ temp: 1.0 });
await wllama.samplingAccept(tokens);
await wllama.decode(tokens, { skipLogits: false });

// Get top 10 most likely tokens
const logits = await wllama.getLogits(10);

console.log('Top 10 tokens:');
for (const { token, p } of logits) {
  const text = await wllama.detokenize([token], true);
  console.log(`Token ${token} (${text}): ${(p * 100).toFixed(2)}%`);
}

// Custom sampling based on logits
const selectedToken = logits[0].token; // Pick highest probability
await wllama.samplingAccept([selectedToken]);
await wllama.decode([selectedToken], {});
```

--------------------------------

### Import and Initialize Wllama in React TypeScript

Source: https://github.com/ngxson/wllama/blob/master/README.md

Import the Wllama class from the installed package and create a new instance. This involves providing configuration paths for the WebAssembly modules. The Wllama instance can then be used to load models and perform inference.

```typescript
import { Wllama } from '@wllama/wllama';
let wllamaInstance = new Wllama(WLLAMA_CONFIG_PATHS, ...);
// (the rest is the same with earlier example)
```

--------------------------------

### Format Chat Messages with Wllama JavaScript

Source: https://context7.com/ngxson/wllama/llms.txt

This example shows how to format chat messages according to a model's specific chat template using the Wllama API. It includes using the model's built-in template, applying a custom Jinja2 template, and retrieving the model's default chat template.

```javascript
const messages = [
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'user', content: 'Hello! Can you help me?' },
  { role: 'assistant', content: 'Of course! What do you need?' },
  { role: 'user', content: 'Tell me about AI.' },
];

// Use model's built-in template
const formatted = await wllama.formatChat(messages, true);
console.log('Formatted chat:\n', formatted);

// Use custom template (Jinja2 format)
const customTemplate = `
{%- for message in messages %}
{{ '<|' + message.role + '|>\n' + message.content + '\n' }}
{%- endfor %}
{%- if add_generation_prompt %}
{{ '<|assistant|>
' }}
{%- endif %}
`;

const customFormatted = await wllama.formatChat(
  messages,
  true,
  customTemplate
);
console.log('Custom formatted:\n', customFormatted);

// Get model's chat template
const template = wllama.getChatTemplate();
if (template) {
  console.log('Model chat template:', template);
}
```

--------------------------------

### Manage OPFS Cache Directly with CacheManager

Source: https://context7.com/ngxson/wllama/llms.txt

Provides direct control over the OPFS cache for advanced use cases and manual file management. Allows listing, opening, getting size, retrieving metadata, deleting specific or multiple files, and clearing the entire cache. Dependencies include the wllama library.

```javascript
const cacheManager = wllama.cacheManager;

// List all cached files
const entries = await cacheManager.list();
for (const entry of entries) {
  console.log('File:', entry.name);
  console.log('Size:', entry.size);
  console.log('ETag:', entry.metadata.etag);
  console.log('Original URL:', entry.metadata.originalURL);
  console.log('Original size:', entry.metadata.originalSize);
}

// Get file name from URL
const fileName = await cacheManager.getNameFromURL(
  'https://example.com/model.gguf'
);
console.log('Cache file name:', fileName);

// Open cached file
const blob = await cacheManager.open(fileName);
if (blob) {
  console.log('File size:', blob.size);
  const arrayBuffer = await blob.arrayBuffer();
  console.log('First 4 bytes:', new Uint8Array(arrayBuffer, 0, 4));
}

// Get file size
const size = await cacheManager.getSize(fileName);
console.log('Size in cache:', size);

// Get metadata
const metadata = await cacheManager.getMetadata(fileName);
if (metadata) {
  console.log('Metadata:', metadata);
}

// Delete specific file
await cacheManager.delete('https://example.com/model.gguf');

// Delete multiple files
await cacheManager.deleteMany((entry) => entry.size > 1000000000);

// Clear entire cache
await cacheManager.clear();
```

--------------------------------

### Wllama Constructor with Configuration Options

Source: https://github.com/ngxson/wllama/blob/master/guides/intro-v2.md

Demonstrates initializing the `Wllama` constructor with an optional second parameter, `WllamaConfig`. This allows setting configuration options like `parallelDownloads` and `allowOffline` directly during instantiation.

```javascript
const wllama = new Wllama(CONFIG_PATHS, {
  parallelDownloads: 5, // maximum concurrent downloads
  allowOffline: false, // whether to allow offline model loading
});
```

--------------------------------

### TypeScript: Initialize Wllama Application and UI

Source: https://github.com/ngxson/wllama/blob/master/examples/basic/index.html

This snippet sets up the main execution flow for the Wllama application. It initializes the UI elements, sets up event handlers for buttons, and defines utility functions to enable/disable UI controls based on the application state. It calls the `startCompletions` and `startEmbeddings` functions to set up the respective functionalities.

```TypeScript
import { Wllama } from '../../esm/index.js';

const CONFIG_PATHS = {
  'single-thread/wllama.wasm': '../../esm/single-thread/wllama.wasm',
  'multi-thread/wllama.wasm': '../../esm/multi-thread/wllama.wasm',
};

const CMPL_MODEL = 'https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf';
const CMPL_MODEL_SIZE = '19MB';
const EMBD_MODEL = 'https://huggingface.co/ggml-org/models/resolve/main/bert-bge-small/ggml-model-f16.gguf';
const EMBD_MODEL_SIZE = '67MB';

async function main() {
  setCmplDisable(true);
  setEmbdDisable(true);
  const getName = (url) => url.match(/\/resolve\/main(.*)/)[1];
  elemCmplModel.textContent = `${getName(CMPL_MODEL)}, size: ${CMPL_MODEL_SIZE}`;
  elemEmbdModel.textContent = `${getName(EMBD_MODEL)}, size: ${EMBD_MODEL_SIZE}`;

  elemBtnStartCmpl.onclick = async () => {
    elemBtnStartCmpl.disabled = true;
    elemBtnPickFile.disabled = true;
    await startCompletions(CMPL_MODEL);
  };

  elemBtnPickFile.onchange = async (event) => {
    const { files } = event.target;
    if (files.length === 0) return;
    elemBtnStartCmpl.disabled = true;
    elemBtnPickFile.disabled = true;
    await startCompletions(null, files);
  };

  elemBtnStartEmbd.onclick = async () => {
    elemBtnStartEmbd.disabled = true;
    await startEmbeddings(EMBD_MODEL);
  };
}

// DOM elements: completions
const elemCmplModel = document.getElementById('cmpl_model');
const elemBtnStartCmpl = document.getElementById('btn_start_cmpl');
const elemBtnPickFile = document.getElementById('btn_pick_file');
const elemInput = document.getElementById('input_prompt');
const elemNPredict = document.getElementById('input_n_predict');
const elemBtnCompletions = document.getElementById('btn_run_cmpl');
const elemOutputCmpl = document.getElementById('output_cmpl');

// DOM elements: embeddings
const elemEmbdModel = document.getElementById('embd_model');
const elemBtnStartEmbd = document.getElementById('btn_start_embd');
const elemInputA = document.getElementById('input_a');
const elemInputB = document.getElementById('input_b');
const elemBtnEmbeddings = document.getElementById('btn_run_embd');
const elemOutputEmbd = document.getElementById('output_embd');

// utils
const setCmplDisable = (disabled) => {
  elemInput.disabled = disabled;
  elemNPredict.disabled = disabled;
  elemBtnCompletions.disabled = disabled;
};
const setEmbdDisable = (disabled) => {
  elemInputA.disabled = disabled;
  elemInputB.disabled = disabled;
  elemBtnEmbeddings.disabled = disabled;
};

main();
```

--------------------------------

### Alternative Wllama Initialization using CDN

Source: https://github.com/ngxson/wllama/blob/master/README.md

Shows an alternative method for initializing Wllama by importing WebAssembly modules directly from a CDN. This approach is useful when embedding WASM files in your project is not feasible, though it's generally recommended to bundle them.

```javascript
import WasmFromCDN from '@wllama/wllama/esm/wasm-from-cdn.js';
const wllama = new Wllama(WasmFromCDN);
// NOTE: this is not recommended, only use when you can't embed wasm files in your project
```

--------------------------------

### Wllama Initialization and Version Check

Source: https://context7.com/ngxson/wllama/llms.txt

Initialize a Wllama instance by providing paths to WebAssembly files. The instance automatically detects browser capabilities and selects the appropriate build. You can also check the libllama version and the model loading status.

```APIDOC
## Initialize Wllama Instance

### Description
Create a new Wllama instance with configuration paths to WebAssembly files. The instance automatically detects browser capabilities and selects the appropriate build. This section also shows how to check the libllama version and the model loading status.

### Method
`new Wllama(configPaths, options)`

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
```javascript
import { Wllama } from '@wllama/wllama';

const CONFIG_PATHS = {
  'single-thread/wllama.wasm': './esm/single-thread/wllama.wasm',
  'multi-thread/wllama.wasm': './esm/multi-thread/wllama.wasm',
};

const wllama = new Wllama(CONFIG_PATHS, {
  suppressNativeLog: false,
  parallelDownloads: 5,
  allowOffline: false,
  logger: console, // or custom logger
});

// Check version
const version = Wllama.getLibllamaVersion();
console.log('libllama version:', version);

// Check if model is loaded
if (wllama.isModelLoaded()) {
  console.log('Multi-thread enabled:', wllama.isMultithread());
  console.log('Number of threads:', wllama.getNumThreads());
}
```

### Response
#### Success Response (Initialization)
- `wllama` (Wllama instance) - The initialized Wllama instance.

#### Success Response (Version Check)
- `version` (string) - The version of libllama.

#### Success Response (Model Loaded Check)
- `isModelLoaded()` (boolean) - Returns true if a model is loaded.
- `isMultithread()` (boolean) - Returns true if multi-threading is enabled.
- `getNumThreads()` (number) - Returns the number of threads used.

#### Response Example (Initialization)
```json
// No direct JSON response, instance is returned.
```

#### Response Example (Version Check)
```json
{
  "libllama version": "1.5.1"
}
```

#### Response Example (Model Loaded Check)
```json
{
  "isModelLoaded": true,
  "isMultithread": true,
  "numThreads": 8
}
```
```

--------------------------------

### Initialize Wllama Instance with Configuration

Source: https://context7.com/ngxson/wllama/llms.txt

Creates a new Wllama instance by providing configuration paths to WebAssembly files. The instance automatically detects browser capabilities and selects the appropriate build (single-thread or multi-thread). It also allows for checking the libllama version and model loading status.

```javascript
import { Wllama } from '@wllama/wllama';

const CONFIG_PATHS = {
  'single-thread/wllama.wasm': './esm/single-thread/wllama.wasm',
  'multi-thread/wllama.wasm': './esm/multi-thread/wllama.wasm',
};

const wllama = new Wllama(CONFIG_PATHS, {
  suppressNativeLog: false,
  parallelDownloads: 5,
  allowOffline: false,
  logger: console, // or custom logger
});

// Check version
const version = Wllama.getLibllamaVersion();
console.log('libllama version:', version);

// Check if model is loaded
if (wllama.isModelLoaded()) {
  console.log('Multi-thread enabled:', wllama.isMultithread());
  console.log('Number of threads:', wllama.getNumThreads());
}
```

--------------------------------

### Basic Usage: Load Model and Create Completion (ES6 Module)

Source: https://github.com/ngxson/wllama/blob/master/README.md

Demonstrates how to use wllama with an ES6 module. It includes importing the Wllama class, defining configuration paths, loading a model from Hugging Face with a progress callback, and creating a text completion with specified parameters like nPredict and sampling settings.

```javascript
import { Wllama } from './esm/index.js';

(async () => {
  const CONFIG_PATHS = {
    'single-thread/wllama.wasm': './esm/single-thread/wllama.wasm',
    'multi-thread/wllama.wasm' : './esm/multi-thread/wllama.wasm',
  };
  // Automatically switch between single-thread and multi-thread version based on browser support
  // If you want to enforce single-thread, add { "n_threads": 1 } to LoadModelConfig
  const wllama = new Wllama(CONFIG_PATHS);
  // Define a function for tracking the model download progress
  const progressCallback =  ({ loaded, total }) => {
    // Calculate the progress as a percentage
    const progressPercentage = Math.round((loaded / total) * 100);
    // Log the progress in a user-friendly format
    console.log(`Downloading... ${progressPercentage}%`);
  };
  // Load GGUF from Hugging Face hub
  // (alternatively, you can use loadModelFromUrl if the model is not from HF hub)
  await wllama.loadModelFromHF(
    'ggml-org/models',
    'tinyllamas/stories260K.gguf',
    {
      progressCallback,
    }
  );
  const outputText = await wllama.createCompletion(elemInput.value, {
    nPredict: 50,
    sampling: {
      temp: 0.5,
      top_k: 40,
      top_p: 0.9,
    },
  });
  console.log(outputText);
})();
```

--------------------------------

### TypeScript: Load and Run Text Completions with Wllama

Source: https://github.com/ngxson/wllama/blob/master/examples/basic/index.html

This snippet demonstrates how to load a language model from a URL or local files and perform text completion using the Wllama library. It handles user input for prompts and prediction parameters, and displays the generated completion in real-time. Dependencies include the Wllama library and DOM elements for UI interaction.

```TypeScript
import { Wllama } from '../../esm/index.js';

const CONFIG_PATHS = {
  'single-thread/wllama.wasm': '../../esm/single-thread/wllama.wasm',
  'multi-thread/wllama.wasm': '../../esm/multi-thread/wllama.wasm',
};

const CMPL_MODEL = 'https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf';
const CMPL_MODEL_SIZE = '19MB';

async function startCompletions(modelUrl, files) {
  const wllama = new Wllama(CONFIG_PATHS);
  // await wllama.cacheManager.clear();
  if (files) {
    await wllama.loadModel(files);
  } else {
    await wllama.loadModelFromUrl(modelUrl);
  }
  setCmplDisable(false);
  elemBtnCompletions.onclick = async () => {
    setCmplDisable(true);
    await wllama.createCompletion(elemInput.value, {
      nPredict: parseInt(elemNPredict.value),
      sampling: {
        temp: 0.5,
        top_k: 40,
        top_p: 0.9,
      },
      onNewToken: (token, piece, currentText) => {
        elemOutputCmpl.textContent = currentText;
      },
    });
    setCmplDisable(false);
  };
}

// DOM elements: completions
const elemCmplModel = document.getElementById('cmpl_model');
const elemBtnStartCmpl = document.getElementById('btn_start_cmpl');
const elemBtnPickFile = document.getElementById('btn_pick_file');
const elemInput = document.getElementById('input_prompt');
const elemNPredict = document.getElementById('input_n_predict');
const elemBtnCompletions = document.getElementById('btn_run_cmpl');
const elemOutputCmpl = document.getElementById('output_cmpl');

// utils
const setCmplDisable = (disabled) => {
  elemInput.disabled = disabled;
  elemNPredict.disabled = disabled;
  elemBtnCompletions.disabled = disabled;
};
```

--------------------------------

### Initialize Wllama with Custom Emoji Logger (JavaScript)

Source: https://github.com/ngxson/wllama/blob/master/README.md

Illustrates how to initialize Wllama with a completely custom logger object. This allows for detailed control over log output, including adding custom prefixes like emojis to different log levels (debug, log, warn, error).

```javascript
const wllama = new Wllama(pathConfig, {
  logger: {
    debug: (...args) => console.debug('🔧', ...args),
    log: (...args) => console.log('ℹ️', ...args),
    warn: (...args) => console.warn('⚠️', ...args),
    error: (...args) => console.error('☠️', ...args),
  },
});
```

--------------------------------

### Create Text Completion

Source: https://context7.com/ngxson/wllama/llms.txt

Generate text completions based on a given prompt. Supports both non-streaming and real-time streaming completions, with options for advanced sampling parameters and abortion.

```APIDOC
## Create Text Completion

### Description
Generate text completions based on a prompt. This API supports both non-streaming, where the full completion is returned at once, and streaming, where tokens are received in real-time. It also allows for cancellation using an AbortController.

### Method
`wllama.createCompletion(prompt, options)`

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Parameters (options object)
- **nPredict** (number) - The maximum number of tokens to predict.
- **sampling** (object) - Sampling parameters for controlling generation. 
  - **temp** (number) - Temperature for sampling (e.g., 0.7).
  - **top_k** (number) - Top-K sampling parameter.
  - **top_p** (number) - Top-p (nucleus) sampling parameter.
  - **penalty_repeat** (number) - Penalty for repeating tokens.
  - **penalty_last_n** (number) - The number of tokens to consider for repeat penalty.
  - **mirostat** (number) - Mirostat algorithm mode (1 or 2).
  - **mirostat_tau** (number) - Mirostat tau parameter.
  - **mirostat_eta** (number) - Mirostat eta parameter.
- **stopTokens** (array) - An array of token IDs to stop generation at.
- **useCache** (boolean) - Whether to use the KV cache for this completion.
- **stream** (boolean) - If true, returns a stream of tokens. If false, returns the full completion.
- **abortSignal** (AbortSignal) - An AbortSignal to cancel the generation process.

### Request Example
```javascript
const prompt = "Once upon a time";

// Non-streaming completion
const outputText = await wllama.createCompletion(prompt, {
  nPredict: 100,
  sampling: {
    temp: 0.7,
    top_k: 40,
    top_p: 0.9,
    penalty_repeat: 1.1,
    penalty_last_n: 64,
  },
  stopTokens: [await wllama.lookupToken('\n\n')],
  useCache: false,
  stream: false,
});
console.log('Generated text:', outputText);

// Streaming completion with abort
const abortController = new AbortController();
setTimeout(() => abortController.abort(), 5000); // Abort after 5 seconds

try {
  const stream = await wllama.createCompletion(prompt, {
    nPredict: 200,
    sampling: {
      temp: 0.8,
      top_k: 50,
      top_p: 0.95,
      mirostat: 2,
      mirostat_tau: 5.0,
      mirostat_eta: 0.1,
    },
    abortSignal: abortController.signal,
    stream: true,
  });

  for await (const chunk of stream) {
    process.stdout.write(new TextDecoder().decode(chunk.piece));
    console.log('Token ID:', chunk.token);
    console.log('Current text:', chunk.currentText);
  }
} catch (error) {
  if (error.name === 'AbortError') {
    console.log('Generation aborted');
  }
}
```

### Response
#### Success Response (Non-streaming)
- `outputText` (string) - The completed text.

#### Success Response (Streaming)
- `stream` (AsyncIterableIterator) - An iterator yielding chunks of data.
  - Each `chunk` object contains:
    - `piece` (Uint8Array) - The decoded text piece.
    - `token` (number) - The token ID.
    - `currentText` (string) - The text generated so far.

#### Response Example (Non-streaming)
```json
{
  "outputText": "Once upon a time, in a land far, far away..."
}
```

#### Response Example (Streaming)
```json
// For each chunk received:
{
  "piece": "In",
  "token": 1234,
  "currentText": "Once upon a time, In"
}
{
  "piece": " a",
  "token": 567,
  "currentText": "Once upon a time, In a"
}
// ... and so on.
```

#### Error Response
- `AbortError` - If the `abortSignal` is triggered.
- Other errors related to model inference.
```

--------------------------------

### Wllama Constructor Configuration Comparison (v1.x vs v2.0)

Source: https://github.com/ngxson/wllama/blob/master/guides/intro-v2.md

Compares the `Wllama` constructor's `CONFIG_PATHS` parameter between v1.x and v2.0. V2.0 simplifies the configuration by requiring only `.wasm` files, whereas v1.x required both `.js` and `.wasm` files. It also shows an alternative using CDN.

```javascript
// Previously in v1.x:
const CONFIG_PATHS = {
  'single-thread/wllama.js'       : '../../esm/single-thread/wllama.js',
  'single-thread/wllama.wasm'     : '../../esm/single-thread/wllama.wasm',
  'multi-thread/wllama.js'        : '../../esm/multi-thread/wllama.js',
  'multi-thread/wllama.wasm'      : '../../esm/multi-thread/wllama.wasm',
  'multi-thread/wllama.worker.mjs': '../../esm/multi-thread/wllama.worker.mjs',
};
const wllama = new Wllama(CONFIG_PATHS);
```

```javascript
// From v2.0:
// You only need to specify 2 files
const CONFIG_PATHS = {
  'single-thread/wllama.wasm': '../../esm/single-thread/wllama.wasm',
  'multi-thread/wllama.wasm' : '../../esm/multi-thread/wllama.wasm',
};
const wllama = new Wllama(CONFIG_PATHS);
```

```javascript
// Alternatively, you can use the *.wasm files from CDN:
import WasmFromCDN from '@wllama/wllama/esm/wasm-from-cdn.js';
const wllama = new Wllama(WasmFromCDN);
// NOTE: this is not recommended
// only use this when you can't embed wasm files in your project
```

--------------------------------

### Generate Text Completion (Streaming and Non-Streaming)

Source: https://context7.com/ngxson/wllama/llms.txt

Generates text completions based on a given prompt. Supports both non-streaming (returning the full output) and streaming (receiving tokens as they are generated) modes. Customizable sampling parameters, stop tokens, and an abort signal for early termination are available. The streaming mode provides chunks of text, token IDs, and the current complete text.

```javascript
const prompt = "Once upon a time";

// Non-streaming completion
const outputText = await wllama.createCompletion(prompt, {
  nPredict: 100,
  sampling: {
    temp: 0.7,
    top_k: 40,
    top_p: 0.9,
    penalty_repeat: 1.1,
    penalty_last_n: 64,
  },
  stopTokens: [await wllama.lookupToken('\n\n')],
  useCache: false,
  stream: false,
});
console.log('Generated text:', outputText);

// Streaming completion with abort
const abortController = new AbortController();
setTimeout(() => abortController.abort(), 5000); // Abort after 5 seconds

try {
  const stream = await wllama.createCompletion(prompt, {
    nPredict: 200,
    sampling: {
      temp: 0.8,
      top_k: 50,
      top_p: 0.95,
      mirostat: 2,
      mirostat_tau: 5.0,
      mirostat_eta: 0.1,
    },
    abortSignal: abortController.signal,
    stream: true,
  });

  for await (const chunk of stream) {
    process.stdout.write(new TextDecoder().decode(chunk.piece));
    console.log('Token ID:', chunk.token);
    console.log('Current text:', chunk.currentText);
  }
} catch (error) {
  if (error.name === 'AbortError') {
    console.log('Generation aborted');
  }
}
```

--------------------------------

### Low-Level Decoding and Sampling with wllama

Source: https://context7.com/ngxson/wllama/llms.txt

Enables manual control over the inference process by allowing fine-grained token generation and sampling. This includes initializing sampling parameters, tokenizing prompts, managing the KV cache, accepting tokens into the sampling context, and decoding tokens. It supports advanced features like logit biasing and checking for end-of-generation tokens.

```javascript
// Initialize sampling context
await wllama.samplingInit({
  temp: 0.8,
  top_k: 40,
  top_p: 0.9,
  penalty_repeat: 1.1,
  penalty_last_n: 64,
  penalty_freq: 0.0,
  penalty_present: 0.0,
  mirostat: 0,
  logit_bias: [
    { token: 123, bias: -1.0 }, // Reduce probability of token 123
    { token: 456, bias: 2.0 },  // Increase probability of token 456
  ],
});

// Tokenize prompt
const prompt = "The meaning of life is";
let tokens = await wllama.tokenize(prompt, true);

// Add BOS token if required
if (wllama.mustAddBosToken()) {
  tokens.unshift(wllama.getBOS());
}

// Clear KV cache
await wllama.kvClear();

// Accept tokens into sampling context
await wllama.samplingAccept(tokens);

// Decode tokens
const { nPast } = await wllama.decode(tokens, { skipLogits: false });
console.log('Tokens in cache:', nPast);

// Generate tokens one by one
let generatedText = '';
for (let i = 0; i < 50; i++) {
  // Sample next token
  const { token, piece } = await wllama.samplingSample();

  // Check if end of generation
  if (wllama.isTokenEOG(token)) {
    break;
  }

  // Convert token to text
  generatedText += new TextDecoder().decode(piece);
  console.log('Generated:', generatedText);

  // Accept token and decode
  await wllama.samplingAccept([token]);
  await wllama.decode([token], {});
}
```

--------------------------------

### Load Split Model from URL with Parallel Downloads

Source: https://context7.com/ngxson/wllama/llms.txt

Loads a large model split into multiple files from a given URL. It utilizes parallel downloads to speed up the loading process and supports custom headers, caching, and context size configuration. A progress callback is provided to monitor download status.

```javascript
const wllama = new Wllama(CONFIG_PATHS, {
  parallelDownloads: 5,
});

await wllama.loadModelFromUrl(
  'https://example.com/model-00001-of-00003.gguf',
  {
    progressCallback: ({ loaded, total }) => {
      console.log(`Progress: ${loaded}/${total}`);
    },
    useCache: true,
    n_ctx: 4096,
    headers: {
      'Authorization': 'Bearer your-token-here',
    },
  }
);

// Model chunks are automatically detected and loaded in parallel
```

--------------------------------

### Load Split Model in Wllama (JavaScript)

Source: https://github.com/ngxson/wllama/blob/master/README.md

Demonstrates how to load a split GGUF model using Wllama's `loadModelFromHF` function. Wllama automatically handles downloading and assembling the model chunks when provided with the URL of the first file. The `parallelDownloads` option can optimize download speed.

```javascript
const wllama = new Wllama(CONFIG_PATHS, {
  parallelDownloads: 5, // optional: maximum files to download in parallel (default: 3)
});
await wllama.loadModelFromHF(
  'ngxson/tinyllama_split_test',
  'stories15M-q8_0-00001-of-00003.gguf'
);
```

--------------------------------

### Load Split Model from URL

Source: https://context7.com/ngxson/wllama/llms.txt

Load large models that have been split into multiple files from a given URL. The library supports parallel downloads for efficiency and can utilize browser caching.

```APIDOC
## Load Split Model from URL

### Description
Load large models split into multiple chunks from a URL. This method supports parallel downloads for faster loading and can use browser caching to store model data efficiently.

### Method
`wllama.loadModelFromUrl(url, options)`

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Parameters (options object)
- **progressCallback** (function) - Optional - Callback function to track download progress. Receives an object `{ loaded: number, total: number }`.
- **useCache** (boolean) - Optional - Whether to use the browser's cache (OPFS) for storing model parts. Defaults to true if OPFS is available.
- **n_ctx** (number) - Optional - The context size for the model.
- **headers** (object) - Optional - An object containing custom headers to send with the download request (e.g., for authentication).

### Request Example
```javascript
const wllama = new Wllama(CONFIG_PATHS, {
  parallelDownloads: 5,
});

await wllama.loadModelFromUrl(
  'https://example.com/model-00001-of-00003.gguf',
  {
    progressCallback: ({ loaded, total }) => {
      console.log(`Progress: ${loaded}/${total}`);
    },
    useCache: true,
    n_ctx: 4096,
    headers: {
      'Authorization': 'Bearer your-token-here',
    },
  }
);

// Model chunks are automatically detected and loaded in parallel
```

### Response
#### Success Response (200)
- Model loading is asynchronous. Upon completion, the model is ready for use.

#### Response Example
```json
// No direct JSON response upon success, the model becomes available for use.
```

#### Error Response
- Errors during download or processing will be thrown.
```

--------------------------------

### Handle Wllama Errors and Abort Operations

Source: https://context7.com/ngxson/wllama/llms.txt

Demonstrates how to catch and handle various Wllama-specific errors, such as model loading failures or inference issues, using `WllamaError`. It also shows how to gracefully abort long-running operations using the `AbortController` and `WllamaAbortError`. Requires the Wllama library.

```javascript
import { Wllama, WllamaError, WllamaAbortError } from '@wllama/wllama';

const wllama = new Wllama(CONFIG_PATHS);

try {
  await wllama.loadModelFromHF('invalid-id', 'model.gguf');
} catch (error) {
  if (error instanceof WllamaError) {
    console.error('Wllama error type:', error.type);
    console.error('Error message:', error.message);

    switch (error.type) {
      case 'model_not_loaded':
        console.log('Model not loaded yet');
        break;
      case 'download_error':
        console.log('Failed to download model');
        break;
      case 'load_error':
        console.log('Failed to load model');
        break;
      case 'kv_cache_full':
        console.log('Context cache is full');
        break;
      case 'inference_error':
        console.log('Inference error occurred');
        break;
      default:
        console.log('Unknown error');
    }
  }
}

// Abort signal usage
const abortController = new AbortController();

// Abort after 3 seconds
setTimeout(() => abortController.abort(), 3000);

try {
  await wllama.createCompletion("Generate long text", {
    nPredict: 1000,
    abortSignal: abortController.signal,
  });
} catch (error) {
  if (error instanceof WllamaAbortError || error.name === 'AbortError') {
    console.log('Generation was aborted by user');
  }
}

// Cleanup
await wllama.exit();
```

--------------------------------

### KV Cache Management with Wllama JavaScript

Source: https://context7.com/ngxson/wllama/llms.txt

This code illustrates how to manage the key-value (KV) cache in Wllama for efficient handling of conversation context. It covers clearing the cache, processing prompts, removing specific tokens from the cache, and retrieving information about the current context.

```javascript
// Clear entire KV cache
await wllama.kvClear();

// Process initial prompt
const prompt1 = "Write a short story about a robot:";
let tokens1 = await wllama.tokenize(prompt1, true);
await wllama.samplingInit({ temp: 0.7 });
await wllama.samplingAccept(tokens1);
await wllama.decode(tokens1, {});

// Generate some tokens...
// (generation code here)

// Remove tokens from cache (keep first 10, remove next 20)
await wllama.kvRemove(10, 20);

// Remove everything after position 15
await wllama.kvRemove(15, -1);

// Get cache status
const ctxInfo = wllama.getLoadedContextInfo();
console.log('Context size:', ctxInfo.n_ctx);
console.log('Batch size:', ctxInfo.n_batch);
console.log('Vocab size:', ctxInfo.n_vocab);
console.log('Model metadata:', ctxInfo.metadata);
```

--------------------------------

### Loading Hugging Face Models with Wllama

Source: https://github.com/ngxson/wllama/blob/master/guides/intro-v2.md

Illustrates the usage of the `loadModelFromHF` helper function to conveniently load models directly from Hugging Face Hub. This function simplifies the process by wrapping `loadModelFromUrl` for HF repository URLs.

```javascript
await wllama.loadModelFromHF(
  'ggml-org/models',
  'tinyllamas/stories260K.gguf'
);
```

--------------------------------

### CMake Build Configuration for wllama

Source: https://github.com/ngxson/wllama/blob/master/CMakeLists.txt

This snippet defines the build process for the wllama executable using CMake. It sets the minimum required CMake version, project name, includes the llama.cpp submodule, and configures threading libraries. It then specifies source files, include directories, and links against necessary libraries.

```cmake
cmake_minimum_required(VERSION 3.14)
project("wllama")
add_subdirectory(llama.cpp)

set(CMAKE_THREAD_LIBS_INIT "-lpthread")
set(CMAKE_HAVE_THREADS_LIBRARY 1)
set(CMAKE_USE_WIN32_THREADS_INIT 0)
set(CMAKE_USE_PTHREADS_INIT 1)
set(THREADS_PREFER_PTHREAD_FLAG ON)

set(WLLAMA_SRC cpp/wllama.cpp
    cpp/actions.hpp
    cpp/glue.hpp
    cpp/helpers/wlog.cpp
    cpp/helpers/wcommon.cpp
    cpp/helpers/wsampling.cpp
    llama.cpp/include/llama.h)
include_directories(${CMAKE_CURRENT_SOURCE_DIR}/cpp)
include_directories(${CMAKE_CURRENT_SOURCE_DIR}/cpp/helpers)
include_directories(${CMAKE_CURRENT_SOURCE_DIR}/llama.cpp/include)

add_executable(wllama ${WLLAMA_SRC})
target_link_libraries(wllama PRIVATE ggml llama ${CMAKE_THREAD_LIBS_INIT})
```

--------------------------------

### Load Model from Hugging Face

Source: https://context7.com/ngxson/wllama/llms.txt

Load a GGUF model directly from Hugging Face Hub. This function includes progress tracking for downloads and automatically handles splitting large models into smaller chunks if necessary.

```APIDOC
## Load Model from Hugging Face

### Description
Load a GGUF model from Hugging Face Hub with progress tracking and automatic split-file handling. This allows easy integration of models hosted on Hugging Face.

### Method
`wllama.loadModelFromHF(repo, model, options)`

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Parameters (options object)
- **progressCallback** (function) - Optional - A callback function to track download progress. Receives an object `{ loaded: number, total: number }`.
- **n_ctx** (number) - Optional - The context size for the model.
- **n_threads** (number) - Optional - The number of threads to use for inference.
- **seed** (number) - Optional - The random seed for generation.
- **embeddings** (boolean) - Optional - Whether to load the model for embeddings.
- **n_batch** (number) - Optional - The batch size for processing.
- **cache_type_k** (string) - Optional - The quantization type for the K cache (e.g., 'f16').
- **cache_type_v** (string) - Optional - The quantization type for the V cache (e.g., 'f16').

### Request Example
```javascript
const progressCallback = ({ loaded, total }) => {
  const percentage = Math.round((loaded / total) * 100);
  console.log(`Downloading... ${percentage}% (${loaded}/${total} bytes)`);
};

try {
  await wllama.loadModelFromHF(
    'ggml-org/models',
    'tinyllamas/stories260K.gguf',
    {
      progressCallback,
      n_ctx: 2048,
      n_threads: 4,
      seed: 42,
      embeddings: false,
      n_batch: 512,
      cache_type_k: 'f16',
      cache_type_v: 'f16',
    }
  );
  console.log('Model loaded successfully');

  // Get model metadata
  const metadata = wllama.getModelMetadata();
  console.log('Vocab size:', metadata.hparams.nVocab);
  console.log('Context training size:', metadata.hparams.nCtxTrain);
  console.log('Embedding dimensions:', metadata.hparams.nEmbd);
} catch (error) {
  console.error('Failed to load model:', error);
}
```

### Response
#### Success Response (200)
- `Model loaded successfully` (console log) - Indicates the model has been loaded.
- `metadata` (object) - An object containing model metadata, including `hparams` (hyperparameters like `nVocab`, `nCtxTrain`, `nEmbd`).

#### Response Example
```json
// Upon successful loading, console logs will appear.
// Model metadata example:
{
  "hparams": {
    "nVocab": 32000,
    "nCtxTrain": 2048,
    "nEmbd": 2560
  }
}
```

#### Error Response
- `Failed to load model:` (console error) - Logs the error if the model fails to load.
```

--------------------------------

### Load GGUF Model from Hugging Face with Progress

Source: https://context7.com/ngxson/wllama/llms.txt

Loads a GGUF model from Hugging Face Hub, providing a progress callback to track download status. This function automatically handles splitting large models into smaller chunks if necessary to overcome ArrayBuffer size limitations. It also allows setting model parameters like context size, threads, and sampling options.

```javascript
const progressCallback = ({ loaded, total }) => {
  const percentage = Math.round((loaded / total) * 100);
  console.log(`Downloading... ${percentage}% (${loaded}/${total} bytes)`);
};

try {
  await wllama.loadModelFromHF(
    'ggml-org/models',
    'tinyllamas/stories260K.gguf',
    {
      progressCallback,
      n_ctx: 2048,
      n_threads: 4,
      seed: 42,
      embeddings: false,
      n_batch: 512,
      cache_type_k: 'f16',
      cache_type_v: 'f16',
    }
  );
  console.log('Model loaded successfully');

  // Get model metadata
  const metadata = wllama.getModelMetadata();
  console.log('Vocab size:', metadata.hparams.nVocab);
  console.log('Context training size:', metadata.hparams.nCtxTrain);
  console.log('Embedding dimensions:', metadata.hparams.nEmbd);
} catch (error) {
  console.error('Failed to load model:', error);
}
```

--------------------------------

### JavaScript - Helper Functions: Print and Timing

Source: https://github.com/ngxson/wllama/blob/master/examples/embeddings/index.html

This JavaScript code defines helper functions for printing output to a designated HTML element and measuring execution time. The `print` function appends messages to an element with the ID 'output', allowing for bold text formatting. The `timeStart` and `timeEnd` functions provide a simple mechanism for timing code execution.

```JavaScript
const elemOutput = document.getElementById('output');
function print(message, bold) {
  const elem = document.createElement('div');
  if (bold) {
    const b = document.createElement('b');
    b.innerText = message;
    elem.appendChild(b);
  } else {
    elem.innerText = message;
  }
  elemOutput.appendChild(elem);
  // scroll to bottom
  setTimeout(() => window.scrollTo({
    top: document.documentElement.scrollHeight - window.innerHeight,
    left: 0,
    behavior: 'smooth',
  }), 10);
}
let __startTime = 0;
function timeStart() {
  __startTime = Date.now();
}
function timeEnd() {
  return Date.now() - __startTime;
}
```

--------------------------------

### Initialize Wllama with Suppressed Debug Logs (JavaScript)

Source: https://github.com/ngxson/wllama/blob/master/README.md

Shows how to initialize the Wllama instance with a custom logger that suppresses debug messages. This is achieved by using the predefined `LoggerWithoutDebug` class provided by the Wllama library.

```javascript
import { Wllama, LoggerWithoutDebug } from '@wllama/wllama';

const wllama = new Wllama(pathConfig, {
  // LoggerWithoutDebug is predefined inside wllama
  logger: LoggerWithoutDebug,
});
```

--------------------------------

### Split GGUF Model using llama-gguf-split

Source: https://github.com/ngxson/wllama/blob/master/README.md

Splits a large GGUF model file into smaller, manageable chunks to overcome ArrayBuffer size limitations and potentially speed up downloads. The `--split-max-size` option controls the maximum size of each chunk. Output files are sequentially numbered.

```bash
# Split the model into chunks of 512 Megabytes
./llama-gguf-split --split-max-size 512M ./my_model.gguf ./my_model
```

--------------------------------

### Model Manager: List, Download, and Remove Models with Wllama JavaScript

Source: https://context7.com/ngxson/wllama/llms.txt

This snippet demonstrates how to use the Wllama ModelManager to interact with cached language models. It covers listing existing models, downloading new models with progress callbacks, validating model integrity, refreshing invalid models, and removing models or clearing the entire cache.

```javascript
import { Wllama } from '@wllama/wllama';

const wllama = new Wllama(CONFIG_PATHS);

// Get all cached models
const models = await wllama.modelManager.getModels();
for (const model of models) {
  console.log('URL:', model.url);
  console.log('Size:', model.size, 'bytes');
  console.log('Files:', model.files.length);
  console.log('Valid:', model.validate());
}

// Download a specific model
const model = await wllama.modelManager.downloadModel(
  'https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories260K.gguf',
  {
    progressCallback: ({ loaded, total }) => {
      console.log(`Progress: ${loaded}/${total}`);
    },
  }
);

// Validate model
const validationStatus = model.validate();
console.log('Model status:', validationStatus);

// Refresh invalid model
if (validationStatus !== 'valid') {
  await model.refresh();
}

// Remove model from cache
await model.remove();

// Clear all models
await wllama.modelManager.clear();
```