### Compile llama.cpp Binary for Wllama (Shell) Source: https://github.com/ngxson/wllama/blob/master/README.md Provides shell commands to compile the llama.cpp binary required by Wllama from its source code. This process involves cloning the repository, optionally updating the llama.cpp submodule, installing npm dependencies, building the WebAssembly module, and finally building the ES module. ```shell # Clone the repository with submodule git clone --recurse-submodules https://github.com/ngxson/wllama.git cd wllama # Optionally, you can run this command to update llama.cpp to latest upstream version (bleeding-edge, use with your own risk!) # git submodule update --remote --merge # Install the required modules npm i # Firstly, build llama.cpp into wasm npm run build:wasm # Then, build ES module npm run build ``` -------------------------------- ### Install wllama in React TypeScript Project Source: https://github.com/ngxson/wllama/blob/master/README.md Install the wllama package using npm in your React TypeScript project. This command adds the necessary library to your project's dependencies, allowing you to import and use the Wllama module. ```bash npm i @wllama/wllama ``` -------------------------------- ### Get Logits for Custom Sampling with Wllama JavaScript Source: https://context7.com/ngxson/wllama/llms.txt This snippet demonstrates how to retrieve token probabilities (logits) from the Wllama model to implement custom sampling strategies. It involves tokenizing input, initializing sampling, accepting tokens, decoding, getting logits, and then selecting a token based on these logits. ```javascript const tokens = await wllama.tokenize("The color of the sky is", true); await wllama.kvClear(); await wllama.samplingInit({ temp: 1.0 }); await wllama.samplingAccept(tokens); await wllama.decode(tokens, { skipLogits: false }); // Get top 10 most likely tokens const logits = await wllama.getLogits(10); console.log('Top 10 tokens:'); for (const { token, p } of logits) { const text = await wllama.detokenize([token], true); console.log(`Token ${token} (${text}): ${(p * 100).toFixed(2)}%`); } // Custom sampling based on logits const selectedToken = logits[0].token; // Pick highest probability await wllama.samplingAccept([selectedToken]); await wllama.decode([selectedToken], {}); ``` -------------------------------- ### Import and Initialize Wllama in React TypeScript Source: https://github.com/ngxson/wllama/blob/master/README.md Import the Wllama class from the installed package and create a new instance. This involves providing configuration paths for the WebAssembly modules. The Wllama instance can then be used to load models and perform inference. ```typescript import { Wllama } from '@wllama/wllama'; let wllamaInstance = new Wllama(WLLAMA_CONFIG_PATHS, ...); // (the rest is the same with earlier example) ``` -------------------------------- ### Format Chat Messages with Wllama JavaScript Source: https://context7.com/ngxson/wllama/llms.txt This example shows how to format chat messages according to a model's specific chat template using the Wllama API. It includes using the model's built-in template, applying a custom Jinja2 template, and retrieving the model's default chat template. ```javascript const messages = [ { role: 'system', content: 'You are a helpful assistant.' }, { role: 'user', content: 'Hello! Can you help me?' }, { role: 'assistant', content: 'Of course! What do you need?' }, { role: 'user', content: 'Tell me about AI.' }, ]; // Use model's built-in template const formatted = await wllama.formatChat(messages, true); console.log('Formatted chat:\n', formatted); // Use custom template (Jinja2 format) const customTemplate = ` {%- for message in messages %} {{ '<|' + message.role + '|>\n' + message.content + '\n' }} {%- endfor %} {%- if add_generation_prompt %} {{ '<|assistant|> ' }} {%- endif %} `; const customFormatted = await wllama.formatChat( messages, true, customTemplate ); console.log('Custom formatted:\n', customFormatted); // Get model's chat template const template = wllama.getChatTemplate(); if (template) { console.log('Model chat template:', template); } ``` -------------------------------- ### Manage OPFS Cache Directly with CacheManager Source: https://context7.com/ngxson/wllama/llms.txt Provides direct control over the OPFS cache for advanced use cases and manual file management. Allows listing, opening, getting size, retrieving metadata, deleting specific or multiple files, and clearing the entire cache. Dependencies include the wllama library. ```javascript const cacheManager = wllama.cacheManager; // List all cached files const entries = await cacheManager.list(); for (const entry of entries) { console.log('File:', entry.name); console.log('Size:', entry.size); console.log('ETag:', entry.metadata.etag); console.log('Original URL:', entry.metadata.originalURL); console.log('Original size:', entry.metadata.originalSize); } // Get file name from URL const fileName = await cacheManager.getNameFromURL( 'https://example.com/model.gguf' ); console.log('Cache file name:', fileName); // Open cached file const blob = await cacheManager.open(fileName); if (blob) { console.log('File size:', blob.size); const arrayBuffer = await blob.arrayBuffer(); console.log('First 4 bytes:', new Uint8Array(arrayBuffer, 0, 4)); } // Get file size const size = await cacheManager.getSize(fileName); console.log('Size in cache:', size); // Get metadata const metadata = await cacheManager.getMetadata(fileName); if (metadata) { console.log('Metadata:', metadata); } // Delete specific file await cacheManager.delete('https://example.com/model.gguf'); // Delete multiple files await cacheManager.deleteMany((entry) => entry.size > 1000000000); // Clear entire cache await cacheManager.clear(); ``` -------------------------------- ### Wllama Constructor with Configuration Options Source: https://github.com/ngxson/wllama/blob/master/guides/intro-v2.md Demonstrates initializing the `Wllama` constructor with an optional second parameter, `WllamaConfig`. This allows setting configuration options like `parallelDownloads` and `allowOffline` directly during instantiation. ```javascript const wllama = new Wllama(CONFIG_PATHS, { parallelDownloads: 5, // maximum concurrent downloads allowOffline: false, // whether to allow offline model loading }); ``` -------------------------------- ### TypeScript: Initialize Wllama Application and UI Source: https://github.com/ngxson/wllama/blob/master/examples/basic/index.html This snippet sets up the main execution flow for the Wllama application. It initializes the UI elements, sets up event handlers for buttons, and defines utility functions to enable/disable UI controls based on the application state. It calls the `startCompletions` and `startEmbeddings` functions to set up the respective functionalities. ```TypeScript import { Wllama } from '../../esm/index.js'; const CONFIG_PATHS = { 'single-thread/wllama.wasm': '../../esm/single-thread/wllama.wasm', 'multi-thread/wllama.wasm': '../../esm/multi-thread/wllama.wasm', }; const CMPL_MODEL = 'https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf'; const CMPL_MODEL_SIZE = '19MB'; const EMBD_MODEL = 'https://huggingface.co/ggml-org/models/resolve/main/bert-bge-small/ggml-model-f16.gguf'; const EMBD_MODEL_SIZE = '67MB'; async function main() { setCmplDisable(true); setEmbdDisable(true); const getName = (url) => url.match(/\/resolve\/main(.*)/)[1]; elemCmplModel.textContent = `${getName(CMPL_MODEL)}, size: ${CMPL_MODEL_SIZE}`; elemEmbdModel.textContent = `${getName(EMBD_MODEL)}, size: ${EMBD_MODEL_SIZE}`; elemBtnStartCmpl.onclick = async () => { elemBtnStartCmpl.disabled = true; elemBtnPickFile.disabled = true; await startCompletions(CMPL_MODEL); }; elemBtnPickFile.onchange = async (event) => { const { files } = event.target; if (files.length === 0) return; elemBtnStartCmpl.disabled = true; elemBtnPickFile.disabled = true; await startCompletions(null, files); }; elemBtnStartEmbd.onclick = async () => { elemBtnStartEmbd.disabled = true; await startEmbeddings(EMBD_MODEL); }; } // DOM elements: completions const elemCmplModel = document.getElementById('cmpl_model'); const elemBtnStartCmpl = document.getElementById('btn_start_cmpl'); const elemBtnPickFile = document.getElementById('btn_pick_file'); const elemInput = document.getElementById('input_prompt'); const elemNPredict = document.getElementById('input_n_predict'); const elemBtnCompletions = document.getElementById('btn_run_cmpl'); const elemOutputCmpl = document.getElementById('output_cmpl'); // DOM elements: embeddings const elemEmbdModel = document.getElementById('embd_model'); const elemBtnStartEmbd = document.getElementById('btn_start_embd'); const elemInputA = document.getElementById('input_a'); const elemInputB = document.getElementById('input_b'); const elemBtnEmbeddings = document.getElementById('btn_run_embd'); const elemOutputEmbd = document.getElementById('output_embd'); // utils const setCmplDisable = (disabled) => { elemInput.disabled = disabled; elemNPredict.disabled = disabled; elemBtnCompletions.disabled = disabled; }; const setEmbdDisable = (disabled) => { elemInputA.disabled = disabled; elemInputB.disabled = disabled; elemBtnEmbeddings.disabled = disabled; }; main(); ``` -------------------------------- ### Alternative Wllama Initialization using CDN Source: https://github.com/ngxson/wllama/blob/master/README.md Shows an alternative method for initializing Wllama by importing WebAssembly modules directly from a CDN. This approach is useful when embedding WASM files in your project is not feasible, though it's generally recommended to bundle them. ```javascript import WasmFromCDN from '@wllama/wllama/esm/wasm-from-cdn.js'; const wllama = new Wllama(WasmFromCDN); // NOTE: this is not recommended, only use when you can't embed wasm files in your project ``` -------------------------------- ### Wllama Initialization and Version Check Source: https://context7.com/ngxson/wllama/llms.txt Initialize a Wllama instance by providing paths to WebAssembly files. The instance automatically detects browser capabilities and selects the appropriate build. You can also check the libllama version and the model loading status. ```APIDOC ## Initialize Wllama Instance ### Description Create a new Wllama instance with configuration paths to WebAssembly files. The instance automatically detects browser capabilities and selects the appropriate build. This section also shows how to check the libllama version and the model loading status. ### Method `new Wllama(configPaths, options)` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ```javascript import { Wllama } from '@wllama/wllama'; const CONFIG_PATHS = { 'single-thread/wllama.wasm': './esm/single-thread/wllama.wasm', 'multi-thread/wllama.wasm': './esm/multi-thread/wllama.wasm', }; const wllama = new Wllama(CONFIG_PATHS, { suppressNativeLog: false, parallelDownloads: 5, allowOffline: false, logger: console, // or custom logger }); // Check version const version = Wllama.getLibllamaVersion(); console.log('libllama version:', version); // Check if model is loaded if (wllama.isModelLoaded()) { console.log('Multi-thread enabled:', wllama.isMultithread()); console.log('Number of threads:', wllama.getNumThreads()); } ``` ### Response #### Success Response (Initialization) - `wllama` (Wllama instance) - The initialized Wllama instance. #### Success Response (Version Check) - `version` (string) - The version of libllama. #### Success Response (Model Loaded Check) - `isModelLoaded()` (boolean) - Returns true if a model is loaded. - `isMultithread()` (boolean) - Returns true if multi-threading is enabled. - `getNumThreads()` (number) - Returns the number of threads used. #### Response Example (Initialization) ```json // No direct JSON response, instance is returned. ``` #### Response Example (Version Check) ```json { "libllama version": "1.5.1" } ``` #### Response Example (Model Loaded Check) ```json { "isModelLoaded": true, "isMultithread": true, "numThreads": 8 } ``` ``` -------------------------------- ### Initialize Wllama Instance with Configuration Source: https://context7.com/ngxson/wllama/llms.txt Creates a new Wllama instance by providing configuration paths to WebAssembly files. The instance automatically detects browser capabilities and selects the appropriate build (single-thread or multi-thread). It also allows for checking the libllama version and model loading status. ```javascript import { Wllama } from '@wllama/wllama'; const CONFIG_PATHS = { 'single-thread/wllama.wasm': './esm/single-thread/wllama.wasm', 'multi-thread/wllama.wasm': './esm/multi-thread/wllama.wasm', }; const wllama = new Wllama(CONFIG_PATHS, { suppressNativeLog: false, parallelDownloads: 5, allowOffline: false, logger: console, // or custom logger }); // Check version const version = Wllama.getLibllamaVersion(); console.log('libllama version:', version); // Check if model is loaded if (wllama.isModelLoaded()) { console.log('Multi-thread enabled:', wllama.isMultithread()); console.log('Number of threads:', wllama.getNumThreads()); } ``` -------------------------------- ### Basic Usage: Load Model and Create Completion (ES6 Module) Source: https://github.com/ngxson/wllama/blob/master/README.md Demonstrates how to use wllama with an ES6 module. It includes importing the Wllama class, defining configuration paths, loading a model from Hugging Face with a progress callback, and creating a text completion with specified parameters like nPredict and sampling settings. ```javascript import { Wllama } from './esm/index.js'; (async () => { const CONFIG_PATHS = { 'single-thread/wllama.wasm': './esm/single-thread/wllama.wasm', 'multi-thread/wllama.wasm' : './esm/multi-thread/wllama.wasm', }; // Automatically switch between single-thread and multi-thread version based on browser support // If you want to enforce single-thread, add { "n_threads": 1 } to LoadModelConfig const wllama = new Wllama(CONFIG_PATHS); // Define a function for tracking the model download progress const progressCallback = ({ loaded, total }) => { // Calculate the progress as a percentage const progressPercentage = Math.round((loaded / total) * 100); // Log the progress in a user-friendly format console.log(`Downloading... ${progressPercentage}%`); }; // Load GGUF from Hugging Face hub // (alternatively, you can use loadModelFromUrl if the model is not from HF hub) await wllama.loadModelFromHF( 'ggml-org/models', 'tinyllamas/stories260K.gguf', { progressCallback, } ); const outputText = await wllama.createCompletion(elemInput.value, { nPredict: 50, sampling: { temp: 0.5, top_k: 40, top_p: 0.9, }, }); console.log(outputText); })(); ``` -------------------------------- ### TypeScript: Load and Run Text Completions with Wllama Source: https://github.com/ngxson/wllama/blob/master/examples/basic/index.html This snippet demonstrates how to load a language model from a URL or local files and perform text completion using the Wllama library. It handles user input for prompts and prediction parameters, and displays the generated completion in real-time. Dependencies include the Wllama library and DOM elements for UI interaction. ```TypeScript import { Wllama } from '../../esm/index.js'; const CONFIG_PATHS = { 'single-thread/wllama.wasm': '../../esm/single-thread/wllama.wasm', 'multi-thread/wllama.wasm': '../../esm/multi-thread/wllama.wasm', }; const CMPL_MODEL = 'https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf'; const CMPL_MODEL_SIZE = '19MB'; async function startCompletions(modelUrl, files) { const wllama = new Wllama(CONFIG_PATHS); // await wllama.cacheManager.clear(); if (files) { await wllama.loadModel(files); } else { await wllama.loadModelFromUrl(modelUrl); } setCmplDisable(false); elemBtnCompletions.onclick = async () => { setCmplDisable(true); await wllama.createCompletion(elemInput.value, { nPredict: parseInt(elemNPredict.value), sampling: { temp: 0.5, top_k: 40, top_p: 0.9, }, onNewToken: (token, piece, currentText) => { elemOutputCmpl.textContent = currentText; }, }); setCmplDisable(false); }; } // DOM elements: completions const elemCmplModel = document.getElementById('cmpl_model'); const elemBtnStartCmpl = document.getElementById('btn_start_cmpl'); const elemBtnPickFile = document.getElementById('btn_pick_file'); const elemInput = document.getElementById('input_prompt'); const elemNPredict = document.getElementById('input_n_predict'); const elemBtnCompletions = document.getElementById('btn_run_cmpl'); const elemOutputCmpl = document.getElementById('output_cmpl'); // utils const setCmplDisable = (disabled) => { elemInput.disabled = disabled; elemNPredict.disabled = disabled; elemBtnCompletions.disabled = disabled; }; ``` -------------------------------- ### Initialize Wllama with Custom Emoji Logger (JavaScript) Source: https://github.com/ngxson/wllama/blob/master/README.md Illustrates how to initialize Wllama with a completely custom logger object. This allows for detailed control over log output, including adding custom prefixes like emojis to different log levels (debug, log, warn, error). ```javascript const wllama = new Wllama(pathConfig, { logger: { debug: (...args) => console.debug('🔧', ...args), log: (...args) => console.log('ℹ️', ...args), warn: (...args) => console.warn('⚠️', ...args), error: (...args) => console.error('☠️', ...args), }, }); ``` -------------------------------- ### Create Text Completion Source: https://context7.com/ngxson/wllama/llms.txt Generate text completions based on a given prompt. Supports both non-streaming and real-time streaming completions, with options for advanced sampling parameters and abortion. ```APIDOC ## Create Text Completion ### Description Generate text completions based on a prompt. This API supports both non-streaming, where the full completion is returned at once, and streaming, where tokens are received in real-time. It also allows for cancellation using an AbortController. ### Method `wllama.createCompletion(prompt, options)` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Parameters (options object) - **nPredict** (number) - The maximum number of tokens to predict. - **sampling** (object) - Sampling parameters for controlling generation. - **temp** (number) - Temperature for sampling (e.g., 0.7). - **top_k** (number) - Top-K sampling parameter. - **top_p** (number) - Top-p (nucleus) sampling parameter. - **penalty_repeat** (number) - Penalty for repeating tokens. - **penalty_last_n** (number) - The number of tokens to consider for repeat penalty. - **mirostat** (number) - Mirostat algorithm mode (1 or 2). - **mirostat_tau** (number) - Mirostat tau parameter. - **mirostat_eta** (number) - Mirostat eta parameter. - **stopTokens** (array) - An array of token IDs to stop generation at. - **useCache** (boolean) - Whether to use the KV cache for this completion. - **stream** (boolean) - If true, returns a stream of tokens. If false, returns the full completion. - **abortSignal** (AbortSignal) - An AbortSignal to cancel the generation process. ### Request Example ```javascript const prompt = "Once upon a time"; // Non-streaming completion const outputText = await wllama.createCompletion(prompt, { nPredict: 100, sampling: { temp: 0.7, top_k: 40, top_p: 0.9, penalty_repeat: 1.1, penalty_last_n: 64, }, stopTokens: [await wllama.lookupToken('\n\n')], useCache: false, stream: false, }); console.log('Generated text:', outputText); // Streaming completion with abort const abortController = new AbortController(); setTimeout(() => abortController.abort(), 5000); // Abort after 5 seconds try { const stream = await wllama.createCompletion(prompt, { nPredict: 200, sampling: { temp: 0.8, top_k: 50, top_p: 0.95, mirostat: 2, mirostat_tau: 5.0, mirostat_eta: 0.1, }, abortSignal: abortController.signal, stream: true, }); for await (const chunk of stream) { process.stdout.write(new TextDecoder().decode(chunk.piece)); console.log('Token ID:', chunk.token); console.log('Current text:', chunk.currentText); } } catch (error) { if (error.name === 'AbortError') { console.log('Generation aborted'); } } ``` ### Response #### Success Response (Non-streaming) - `outputText` (string) - The completed text. #### Success Response (Streaming) - `stream` (AsyncIterableIterator) - An iterator yielding chunks of data. - Each `chunk` object contains: - `piece` (Uint8Array) - The decoded text piece. - `token` (number) - The token ID. - `currentText` (string) - The text generated so far. #### Response Example (Non-streaming) ```json { "outputText": "Once upon a time, in a land far, far away..." } ``` #### Response Example (Streaming) ```json // For each chunk received: { "piece": "In", "token": 1234, "currentText": "Once upon a time, In" } { "piece": " a", "token": 567, "currentText": "Once upon a time, In a" } // ... and so on. ``` #### Error Response - `AbortError` - If the `abortSignal` is triggered. - Other errors related to model inference. ``` -------------------------------- ### Wllama Constructor Configuration Comparison (v1.x vs v2.0) Source: https://github.com/ngxson/wllama/blob/master/guides/intro-v2.md Compares the `Wllama` constructor's `CONFIG_PATHS` parameter between v1.x and v2.0. V2.0 simplifies the configuration by requiring only `.wasm` files, whereas v1.x required both `.js` and `.wasm` files. It also shows an alternative using CDN. ```javascript // Previously in v1.x: const CONFIG_PATHS = { 'single-thread/wllama.js' : '../../esm/single-thread/wllama.js', 'single-thread/wllama.wasm' : '../../esm/single-thread/wllama.wasm', 'multi-thread/wllama.js' : '../../esm/multi-thread/wllama.js', 'multi-thread/wllama.wasm' : '../../esm/multi-thread/wllama.wasm', 'multi-thread/wllama.worker.mjs': '../../esm/multi-thread/wllama.worker.mjs', }; const wllama = new Wllama(CONFIG_PATHS); ``` ```javascript // From v2.0: // You only need to specify 2 files const CONFIG_PATHS = { 'single-thread/wllama.wasm': '../../esm/single-thread/wllama.wasm', 'multi-thread/wllama.wasm' : '../../esm/multi-thread/wllama.wasm', }; const wllama = new Wllama(CONFIG_PATHS); ``` ```javascript // Alternatively, you can use the *.wasm files from CDN: import WasmFromCDN from '@wllama/wllama/esm/wasm-from-cdn.js'; const wllama = new Wllama(WasmFromCDN); // NOTE: this is not recommended // only use this when you can't embed wasm files in your project ``` -------------------------------- ### Generate Text Completion (Streaming and Non-Streaming) Source: https://context7.com/ngxson/wllama/llms.txt Generates text completions based on a given prompt. Supports both non-streaming (returning the full output) and streaming (receiving tokens as they are generated) modes. Customizable sampling parameters, stop tokens, and an abort signal for early termination are available. The streaming mode provides chunks of text, token IDs, and the current complete text. ```javascript const prompt = "Once upon a time"; // Non-streaming completion const outputText = await wllama.createCompletion(prompt, { nPredict: 100, sampling: { temp: 0.7, top_k: 40, top_p: 0.9, penalty_repeat: 1.1, penalty_last_n: 64, }, stopTokens: [await wllama.lookupToken('\n\n')], useCache: false, stream: false, }); console.log('Generated text:', outputText); // Streaming completion with abort const abortController = new AbortController(); setTimeout(() => abortController.abort(), 5000); // Abort after 5 seconds try { const stream = await wllama.createCompletion(prompt, { nPredict: 200, sampling: { temp: 0.8, top_k: 50, top_p: 0.95, mirostat: 2, mirostat_tau: 5.0, mirostat_eta: 0.1, }, abortSignal: abortController.signal, stream: true, }); for await (const chunk of stream) { process.stdout.write(new TextDecoder().decode(chunk.piece)); console.log('Token ID:', chunk.token); console.log('Current text:', chunk.currentText); } } catch (error) { if (error.name === 'AbortError') { console.log('Generation aborted'); } } ``` -------------------------------- ### Low-Level Decoding and Sampling with wllama Source: https://context7.com/ngxson/wllama/llms.txt Enables manual control over the inference process by allowing fine-grained token generation and sampling. This includes initializing sampling parameters, tokenizing prompts, managing the KV cache, accepting tokens into the sampling context, and decoding tokens. It supports advanced features like logit biasing and checking for end-of-generation tokens. ```javascript // Initialize sampling context await wllama.samplingInit({ temp: 0.8, top_k: 40, top_p: 0.9, penalty_repeat: 1.1, penalty_last_n: 64, penalty_freq: 0.0, penalty_present: 0.0, mirostat: 0, logit_bias: [ { token: 123, bias: -1.0 }, // Reduce probability of token 123 { token: 456, bias: 2.0 }, // Increase probability of token 456 ], }); // Tokenize prompt const prompt = "The meaning of life is"; let tokens = await wllama.tokenize(prompt, true); // Add BOS token if required if (wllama.mustAddBosToken()) { tokens.unshift(wllama.getBOS()); } // Clear KV cache await wllama.kvClear(); // Accept tokens into sampling context await wllama.samplingAccept(tokens); // Decode tokens const { nPast } = await wllama.decode(tokens, { skipLogits: false }); console.log('Tokens in cache:', nPast); // Generate tokens one by one let generatedText = ''; for (let i = 0; i < 50; i++) { // Sample next token const { token, piece } = await wllama.samplingSample(); // Check if end of generation if (wllama.isTokenEOG(token)) { break; } // Convert token to text generatedText += new TextDecoder().decode(piece); console.log('Generated:', generatedText); // Accept token and decode await wllama.samplingAccept([token]); await wllama.decode([token], {}); } ``` -------------------------------- ### Load Split Model from URL with Parallel Downloads Source: https://context7.com/ngxson/wllama/llms.txt Loads a large model split into multiple files from a given URL. It utilizes parallel downloads to speed up the loading process and supports custom headers, caching, and context size configuration. A progress callback is provided to monitor download status. ```javascript const wllama = new Wllama(CONFIG_PATHS, { parallelDownloads: 5, }); await wllama.loadModelFromUrl( 'https://example.com/model-00001-of-00003.gguf', { progressCallback: ({ loaded, total }) => { console.log(`Progress: ${loaded}/${total}`); }, useCache: true, n_ctx: 4096, headers: { 'Authorization': 'Bearer your-token-here', }, } ); // Model chunks are automatically detected and loaded in parallel ``` -------------------------------- ### Load Split Model in Wllama (JavaScript) Source: https://github.com/ngxson/wllama/blob/master/README.md Demonstrates how to load a split GGUF model using Wllama's `loadModelFromHF` function. Wllama automatically handles downloading and assembling the model chunks when provided with the URL of the first file. The `parallelDownloads` option can optimize download speed. ```javascript const wllama = new Wllama(CONFIG_PATHS, { parallelDownloads: 5, // optional: maximum files to download in parallel (default: 3) }); await wllama.loadModelFromHF( 'ngxson/tinyllama_split_test', 'stories15M-q8_0-00001-of-00003.gguf' ); ``` -------------------------------- ### Load Split Model from URL Source: https://context7.com/ngxson/wllama/llms.txt Load large models that have been split into multiple files from a given URL. The library supports parallel downloads for efficiency and can utilize browser caching. ```APIDOC ## Load Split Model from URL ### Description Load large models split into multiple chunks from a URL. This method supports parallel downloads for faster loading and can use browser caching to store model data efficiently. ### Method `wllama.loadModelFromUrl(url, options)` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Parameters (options object) - **progressCallback** (function) - Optional - Callback function to track download progress. Receives an object `{ loaded: number, total: number }`. - **useCache** (boolean) - Optional - Whether to use the browser's cache (OPFS) for storing model parts. Defaults to true if OPFS is available. - **n_ctx** (number) - Optional - The context size for the model. - **headers** (object) - Optional - An object containing custom headers to send with the download request (e.g., for authentication). ### Request Example ```javascript const wllama = new Wllama(CONFIG_PATHS, { parallelDownloads: 5, }); await wllama.loadModelFromUrl( 'https://example.com/model-00001-of-00003.gguf', { progressCallback: ({ loaded, total }) => { console.log(`Progress: ${loaded}/${total}`); }, useCache: true, n_ctx: 4096, headers: { 'Authorization': 'Bearer your-token-here', }, } ); // Model chunks are automatically detected and loaded in parallel ``` ### Response #### Success Response (200) - Model loading is asynchronous. Upon completion, the model is ready for use. #### Response Example ```json // No direct JSON response upon success, the model becomes available for use. ``` #### Error Response - Errors during download or processing will be thrown. ``` -------------------------------- ### Handle Wllama Errors and Abort Operations Source: https://context7.com/ngxson/wllama/llms.txt Demonstrates how to catch and handle various Wllama-specific errors, such as model loading failures or inference issues, using `WllamaError`. It also shows how to gracefully abort long-running operations using the `AbortController` and `WllamaAbortError`. Requires the Wllama library. ```javascript import { Wllama, WllamaError, WllamaAbortError } from '@wllama/wllama'; const wllama = new Wllama(CONFIG_PATHS); try { await wllama.loadModelFromHF('invalid-id', 'model.gguf'); } catch (error) { if (error instanceof WllamaError) { console.error('Wllama error type:', error.type); console.error('Error message:', error.message); switch (error.type) { case 'model_not_loaded': console.log('Model not loaded yet'); break; case 'download_error': console.log('Failed to download model'); break; case 'load_error': console.log('Failed to load model'); break; case 'kv_cache_full': console.log('Context cache is full'); break; case 'inference_error': console.log('Inference error occurred'); break; default: console.log('Unknown error'); } } } // Abort signal usage const abortController = new AbortController(); // Abort after 3 seconds setTimeout(() => abortController.abort(), 3000); try { await wllama.createCompletion("Generate long text", { nPredict: 1000, abortSignal: abortController.signal, }); } catch (error) { if (error instanceof WllamaAbortError || error.name === 'AbortError') { console.log('Generation was aborted by user'); } } // Cleanup await wllama.exit(); ``` -------------------------------- ### KV Cache Management with Wllama JavaScript Source: https://context7.com/ngxson/wllama/llms.txt This code illustrates how to manage the key-value (KV) cache in Wllama for efficient handling of conversation context. It covers clearing the cache, processing prompts, removing specific tokens from the cache, and retrieving information about the current context. ```javascript // Clear entire KV cache await wllama.kvClear(); // Process initial prompt const prompt1 = "Write a short story about a robot:"; let tokens1 = await wllama.tokenize(prompt1, true); await wllama.samplingInit({ temp: 0.7 }); await wllama.samplingAccept(tokens1); await wllama.decode(tokens1, {}); // Generate some tokens... // (generation code here) // Remove tokens from cache (keep first 10, remove next 20) await wllama.kvRemove(10, 20); // Remove everything after position 15 await wllama.kvRemove(15, -1); // Get cache status const ctxInfo = wllama.getLoadedContextInfo(); console.log('Context size:', ctxInfo.n_ctx); console.log('Batch size:', ctxInfo.n_batch); console.log('Vocab size:', ctxInfo.n_vocab); console.log('Model metadata:', ctxInfo.metadata); ``` -------------------------------- ### Loading Hugging Face Models with Wllama Source: https://github.com/ngxson/wllama/blob/master/guides/intro-v2.md Illustrates the usage of the `loadModelFromHF` helper function to conveniently load models directly from Hugging Face Hub. This function simplifies the process by wrapping `loadModelFromUrl` for HF repository URLs. ```javascript await wllama.loadModelFromHF( 'ggml-org/models', 'tinyllamas/stories260K.gguf' ); ``` -------------------------------- ### CMake Build Configuration for wllama Source: https://github.com/ngxson/wllama/blob/master/CMakeLists.txt This snippet defines the build process for the wllama executable using CMake. It sets the minimum required CMake version, project name, includes the llama.cpp submodule, and configures threading libraries. It then specifies source files, include directories, and links against necessary libraries. ```cmake cmake_minimum_required(VERSION 3.14) project("wllama") add_subdirectory(llama.cpp) set(CMAKE_THREAD_LIBS_INIT "-lpthread") set(CMAKE_HAVE_THREADS_LIBRARY 1) set(CMAKE_USE_WIN32_THREADS_INIT 0) set(CMAKE_USE_PTHREADS_INIT 1) set(THREADS_PREFER_PTHREAD_FLAG ON) set(WLLAMA_SRC cpp/wllama.cpp cpp/actions.hpp cpp/glue.hpp cpp/helpers/wlog.cpp cpp/helpers/wcommon.cpp cpp/helpers/wsampling.cpp llama.cpp/include/llama.h) include_directories(${CMAKE_CURRENT_SOURCE_DIR}/cpp) include_directories(${CMAKE_CURRENT_SOURCE_DIR}/cpp/helpers) include_directories(${CMAKE_CURRENT_SOURCE_DIR}/llama.cpp/include) add_executable(wllama ${WLLAMA_SRC}) target_link_libraries(wllama PRIVATE ggml llama ${CMAKE_THREAD_LIBS_INIT}) ``` -------------------------------- ### Load Model from Hugging Face Source: https://context7.com/ngxson/wllama/llms.txt Load a GGUF model directly from Hugging Face Hub. This function includes progress tracking for downloads and automatically handles splitting large models into smaller chunks if necessary. ```APIDOC ## Load Model from Hugging Face ### Description Load a GGUF model from Hugging Face Hub with progress tracking and automatic split-file handling. This allows easy integration of models hosted on Hugging Face. ### Method `wllama.loadModelFromHF(repo, model, options)` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Parameters (options object) - **progressCallback** (function) - Optional - A callback function to track download progress. Receives an object `{ loaded: number, total: number }`. - **n_ctx** (number) - Optional - The context size for the model. - **n_threads** (number) - Optional - The number of threads to use for inference. - **seed** (number) - Optional - The random seed for generation. - **embeddings** (boolean) - Optional - Whether to load the model for embeddings. - **n_batch** (number) - Optional - The batch size for processing. - **cache_type_k** (string) - Optional - The quantization type for the K cache (e.g., 'f16'). - **cache_type_v** (string) - Optional - The quantization type for the V cache (e.g., 'f16'). ### Request Example ```javascript const progressCallback = ({ loaded, total }) => { const percentage = Math.round((loaded / total) * 100); console.log(`Downloading... ${percentage}% (${loaded}/${total} bytes)`); }; try { await wllama.loadModelFromHF( 'ggml-org/models', 'tinyllamas/stories260K.gguf', { progressCallback, n_ctx: 2048, n_threads: 4, seed: 42, embeddings: false, n_batch: 512, cache_type_k: 'f16', cache_type_v: 'f16', } ); console.log('Model loaded successfully'); // Get model metadata const metadata = wllama.getModelMetadata(); console.log('Vocab size:', metadata.hparams.nVocab); console.log('Context training size:', metadata.hparams.nCtxTrain); console.log('Embedding dimensions:', metadata.hparams.nEmbd); } catch (error) { console.error('Failed to load model:', error); } ``` ### Response #### Success Response (200) - `Model loaded successfully` (console log) - Indicates the model has been loaded. - `metadata` (object) - An object containing model metadata, including `hparams` (hyperparameters like `nVocab`, `nCtxTrain`, `nEmbd`). #### Response Example ```json // Upon successful loading, console logs will appear. // Model metadata example: { "hparams": { "nVocab": 32000, "nCtxTrain": 2048, "nEmbd": 2560 } } ``` #### Error Response - `Failed to load model:` (console error) - Logs the error if the model fails to load. ``` -------------------------------- ### Load GGUF Model from Hugging Face with Progress Source: https://context7.com/ngxson/wllama/llms.txt Loads a GGUF model from Hugging Face Hub, providing a progress callback to track download status. This function automatically handles splitting large models into smaller chunks if necessary to overcome ArrayBuffer size limitations. It also allows setting model parameters like context size, threads, and sampling options. ```javascript const progressCallback = ({ loaded, total }) => { const percentage = Math.round((loaded / total) * 100); console.log(`Downloading... ${percentage}% (${loaded}/${total} bytes)`); }; try { await wllama.loadModelFromHF( 'ggml-org/models', 'tinyllamas/stories260K.gguf', { progressCallback, n_ctx: 2048, n_threads: 4, seed: 42, embeddings: false, n_batch: 512, cache_type_k: 'f16', cache_type_v: 'f16', } ); console.log('Model loaded successfully'); // Get model metadata const metadata = wllama.getModelMetadata(); console.log('Vocab size:', metadata.hparams.nVocab); console.log('Context training size:', metadata.hparams.nCtxTrain); console.log('Embedding dimensions:', metadata.hparams.nEmbd); } catch (error) { console.error('Failed to load model:', error); } ``` -------------------------------- ### JavaScript - Helper Functions: Print and Timing Source: https://github.com/ngxson/wllama/blob/master/examples/embeddings/index.html This JavaScript code defines helper functions for printing output to a designated HTML element and measuring execution time. The `print` function appends messages to an element with the ID 'output', allowing for bold text formatting. The `timeStart` and `timeEnd` functions provide a simple mechanism for timing code execution. ```JavaScript const elemOutput = document.getElementById('output'); function print(message, bold) { const elem = document.createElement('div'); if (bold) { const b = document.createElement('b'); b.innerText = message; elem.appendChild(b); } else { elem.innerText = message; } elemOutput.appendChild(elem); // scroll to bottom setTimeout(() => window.scrollTo({ top: document.documentElement.scrollHeight - window.innerHeight, left: 0, behavior: 'smooth', }), 10); } let __startTime = 0; function timeStart() { __startTime = Date.now(); } function timeEnd() { return Date.now() - __startTime; } ``` -------------------------------- ### Initialize Wllama with Suppressed Debug Logs (JavaScript) Source: https://github.com/ngxson/wllama/blob/master/README.md Shows how to initialize the Wllama instance with a custom logger that suppresses debug messages. This is achieved by using the predefined `LoggerWithoutDebug` class provided by the Wllama library. ```javascript import { Wllama, LoggerWithoutDebug } from '@wllama/wllama'; const wllama = new Wllama(pathConfig, { // LoggerWithoutDebug is predefined inside wllama logger: LoggerWithoutDebug, }); ``` -------------------------------- ### Split GGUF Model using llama-gguf-split Source: https://github.com/ngxson/wllama/blob/master/README.md Splits a large GGUF model file into smaller, manageable chunks to overcome ArrayBuffer size limitations and potentially speed up downloads. The `--split-max-size` option controls the maximum size of each chunk. Output files are sequentially numbered. ```bash # Split the model into chunks of 512 Megabytes ./llama-gguf-split --split-max-size 512M ./my_model.gguf ./my_model ``` -------------------------------- ### Model Manager: List, Download, and Remove Models with Wllama JavaScript Source: https://context7.com/ngxson/wllama/llms.txt This snippet demonstrates how to use the Wllama ModelManager to interact with cached language models. It covers listing existing models, downloading new models with progress callbacks, validating model integrity, refreshing invalid models, and removing models or clearing the entire cache. ```javascript import { Wllama } from '@wllama/wllama'; const wllama = new Wllama(CONFIG_PATHS); // Get all cached models const models = await wllama.modelManager.getModels(); for (const model of models) { console.log('URL:', model.url); console.log('Size:', model.size, 'bytes'); console.log('Files:', model.files.length); console.log('Valid:', model.validate()); } // Download a specific model const model = await wllama.modelManager.downloadModel( 'https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories260K.gguf', { progressCallback: ({ loaded, total }) => { console.log(`Progress: ${loaded}/${total}`); }, } ); // Validate model const validationStatus = model.validate(); console.log('Model status:', validationStatus); // Refresh invalid model if (validationStatus !== 'valid') { await model.refresh(); } // Remove model from cache await model.remove(); // Clear all models await wllama.modelManager.clear(); ```