### Setup Python Virtual Environment

Source: https://github.com/anbeeld/beellama.cpp/blob/main/examples/model-conversion/README.md

Create and activate a Python virtual environment, then install dependencies from requirements.txt.

```console
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

--------------------------------

### Running Llama Server with Jinja Templates

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/function-calling.md

Examples of starting the llama server with different models and Jinja templates for generic format support.

```bash
llama-server --jinja -fa -hf bartowski/phi-4-GGUF:Q4_0
```

```bash
llama-server --jinja -fa -hf bartowski/gemma-2-2b-it-GGUF:Q8_0
```

```bash
llama-server --jinja -fa -hf bartowski/c4ai-command-r-v01-GGUF:Q2_K
```

--------------------------------

### Start Development Server

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/ui/README.md

Start the Vite development server for the UI frontend.

```bash
npm run dev
```

--------------------------------

### Basic CMake Project Setup

Source: https://github.com/anbeeld/beellama.cpp/blob/main/ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt

Sets the minimum CMake version and defines the project name and languages. This is a standard starting point for any CMake project.

```cmake
cmake_minimum_required(VERSION 3.19)
project("vulkan-shaders-gen" C CXX)
```

--------------------------------

### Set up LunarG Vulkan SDK Environment

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/build.md

Source the setup script to configure your environment for the LunarG Vulkan SDK on macOS. This is typically done after installation.

```bash
source /path/to/vulkan-sdk/setup-env.sh
```

--------------------------------

### Install Dependencies

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/ui/README.md

Navigate to the ui directory and install project dependencies using npm.

```bash
cd tools/ui
npm install
```

--------------------------------

### LLaDA Architecture Example

Source: https://github.com/anbeeld/beellama.cpp/blob/main/examples/diffusion/README.md

Example command for running the LLaDA architecture with block-based scheduling and visualization.

```bash
llama-diffusion-cli -m llada-8b.gguf -p "write code to train MNIST in pytorch" -ub 512 --diffusion-block-length 32 --diffusion-steps 256 --diffusion-visual
```

--------------------------------

### Install and Verify clinfo

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/backend/SYCL.md

Install the 'clinfo' utility to verify GPU driver installation and list available OpenCL devices.

```shell
sudo apt install clinfo
sudo clinfo -l
```

--------------------------------

### Using DFlash and TurboQuant with BeeLlama.cpp Server

Source: https://github.com/anbeeld/beellama.cpp/blob/main/README.md

Example of configuring the BeeLlama.cpp server to use DFlash and TurboQuant for optimized inference. This setup is for advanced performance tuning.

```sh
llama-server -m target.gguf --spec-type dflash \
  --spec-draft-model drafter.gguf \
  --spec-draft-ngl all \
  --flash-attn on --cache-type-k turbo4 --cache-type-v turbo3_tcq
```

--------------------------------

### Install k6 and xk6-sse Extension

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/bench/README.md

Build k6 with the xk6-sse extension to support SSE. Requires Go to be installed.

```shell
go install go.k6.io/xk6/cmd/xk6@latest
$GOPATH/bin/xk6 build master \
--with github.com/phymbert/xk6-sse
```

--------------------------------

### Install Executable

Source: https://github.com/anbeeld/beellama.cpp/blob/main/examples/training/CMakeLists.txt

Installs the 'llama-finetune' target, making its runtime available.

```cmake
install(TARGETS ${TARGET} RUNTIME)
```

--------------------------------

### Run Retrieval Example

Source: https://github.com/anbeeld/beellama.cpp/blob/main/examples/retrieval/README.md

Execute the retrieval example with specified model, context files, chunk size, and separator.

```bash
llama-retrieval --model ./models/bge-base-en-v1.5-f16.gguf --top-k 3 --context-file README.md --context-file License --chunk-size 100 --chunk-separator .
```

--------------------------------

### Install SPEED-Bench Client

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/bench/speed-bench/README.md

Install the necessary Python packages for the SPEED-Bench client. Ensure you are in the correct directory.

```bash
pip install -r tools/server/bench/speed-bench/requirements.txt
```

--------------------------------

### Model Preset Configuration Example

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README.md

Example .ini file demonstrating how to define global and model-specific configurations for llama-server presets.

```ini
version = 1

; (Optional) This section provides global settings shared across all presets.
; If the same key is defined in a specific preset, it will override the value in this global section.
[*]
c = 8192
n-gpu-layers = 8

; If the key corresponds to an existing model on the server,
; this will be used as the default config for that model
[ggml-org/MY-MODEL-GGUF:Q8_0]
; string value
chat-template = chatml
; numeric value
n-gpu-layers = 123
; flag value (for certain flags, you need to use the "no-" prefix for negation)
jinja = true
; shorthand argument (for example, context size)
c = 4096
; environment variable name
LLAMA_ARG_CACHE_RAM = 0
; file paths are relative to server's CWD
model-draft = ./my-models/draft.gguf
; but it's RECOMMENDED to use absolute path
model-draft = /Users/abc/my-models/draft.gguf

; If the key does NOT correspond to an existing model, 
; you need to specify at least the model path or HF repo
[custom_model]
model = /Users/abc/my-awesome-model-Q4_K_M.gguf
```

--------------------------------

### Install HTP Library

Source: https://github.com/anbeeld/beellama.cpp/blob/main/ggml/src/ggml-hexagon/htp/CMakeLists.txt

Installs the built HTP library.

```cmake
install(TARGETS ${HTP_LIB})
```

--------------------------------

### Dream Architecture Example

Source: https://github.com/anbeeld/beellama.cpp/blob/main/examples/diffusion/README.md

Example command for running the Dream architecture with specified diffusion parameters and visualization enabled.

```bash
llama-diffusion-cli -m dream7b.gguf -p "write code to train MNIST in pytorch" -ub 512 --diffusion-eps 0.001 --diffusion-algorithm 3 --diffusion-steps 256 --diffusion-visual
```

--------------------------------

### Start Embedding Server

Source: https://github.com/anbeeld/beellama.cpp/blob/main/examples/model-conversion/README.md

Starts the embedding server for model verification. Ensure the virtual environment is activated.

```console
(venv) $ make embedding-start-embedding-server

```

--------------------------------

### RND1 Architecture Example

Source: https://github.com/anbeeld/beellama.cpp/blob/main/examples/diffusion/README.md

Example command for running the RND1 architecture with specific diffusion algorithm, sampling temperature, and epsilon.

```bash
llama-diffusion-cli -m RND1-Base-0910.gguf -p "write code to train MNIST in pytorch" -ub 512 --diffusion-algorithm 1 --diffusion-steps 256 --diffusion-visual --temp 0.5 --diffusion-eps 0.001
```

--------------------------------

### Install Build Tools on Windows

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/backend/OPENVINO.md

Installs Git, Wget, and Ninja build tools on Windows using winget.

```powershell
# Windows PowerShell
winget install Git.Git
winget install GNU.Wget
winget install Ninja-build.Ninja
```

--------------------------------

### Build and Install llama.cpp for Android

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/android.md

After configuration, use these commands to build the project in release mode and install it to a specified directory.

```bash
cmake --build build-android --config Release -j{n}
```

```bash
cmake --install build-android --prefix {install-dir} --config Release
```

--------------------------------

### Run MobileVLM CLI Example

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/multimodal/MobileVLM.md

Example of how to run the MobileVLM command-line interface with a specified model, mmproj, and chat template.

```sh
./llama-mtmd-cli -m MobileVLM-1.7B/ggml-model-q4_k.gguf \
    --mmproj MobileVLM-1.7B/mmproj-model-f16.gguf \
    --chat-template deepseek
```

--------------------------------

### Run Interactive Chat Example (Bash)

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README.md

Command to run the interactive chat example using Bash, curl, and jq.

```sh
bash chat.sh
```

--------------------------------

### Start LLM Model Server for TTS

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/tts/README.md

Starts a llama-server instance to serve the OuteTTS LLM model on port 8020. This is part of running the TTS example with llama-server.

```console
$ ./build/bin/llama-server -m ./models/outetts-0.2-0.5B-q8_0.gguf --port 8020
```

--------------------------------

### Install NDK and OpenCL Headers/Library for Android

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/backend/OPENCL.md

Installs the Android NDK and clones/builds OpenCL headers and ICD loader for Android development. Ensure the NDK path and version match your setup.

```shell
cd ~
wget https://dl.google.com/android/repository/commandlinetools-linux-8512546_latest.zip && \
unzip commandlinetools-linux-8512546_latest.zip && \
mkdir -p ~/android-sdk/cmdline-tools && \
mv cmdline-tools latest && \
mv latest ~/android-sdk/cmdline-tools/ && \
rm -rf commandlinetools-linux-8512546_latest.zip

yes | ~/android-sdk/cmdline-tools/latest/bin/sdkmanager "ndk;26.3.11579264"

```

```shell
mkdir -p ~/dev/llm
cd ~/dev/llm

git clone https://github.com/KhronosGroup/OpenCL-Headers && \
cd OpenCL-Headers && \
cp -r CL ~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include

cd ~/dev/llm

git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && \
cd OpenCL-ICD-Loader && \
mkdir build_ndk26 && cd build_ndk26 && \
cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_TOOLCHAIN_FILE=$HOME/android-sdk/ndk/26.3.11579264/build/cmake/android.toolchain.cmake \
  -DOPENCL_ICD_LOADER_HEADERS_DIR=$HOME/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include \
  -DANDROID_ABI=arm64-v8a \
  -DANDROID_PLATFORM=24 \
  -DANDROID_STL=c++_shared && \
ninja && \
cp libOpenCL.so ~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android

```

--------------------------------

### Basic Usage Example

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/batched-bench/README.md

Run the benchmark with specified model, context size, batch sizes, and prompt/generation token counts. This example shows a custom set of batches.

```bash
./llama-batched-bench -m model.gguf -c 2048 -b 2048 -ub 512 -npp 128,256,512 -ntg 128,256 -npl 1,2,4,8,16,32 [-pps]
```

--------------------------------

### Complete VirtGPU Configuration Example (macOS Metal)

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/backend/VirtGPU/configuration.md

This example shows how to configure VirtGPU for a macOS host using the Metal backend. It sets environment variables for the hypervisor, backend, and optional logging.

```bash
# Hypervisor environment
export VIRGL_APIR_BACKEND_LIBRARY="/opt/llama.cpp/lib/libggml-virtgpu-backend.dylib"

# Backend configuration
export APIR_LLAMA_CPP_GGML_LIBRARY_PATH="/opt/llama.cpp/lib/libggml-metal.dylib"
export APIR_LLAMA_CPP_GGML_LIBRARY_REG="ggml_backend_metal_reg"

# Optional logging
export VIRGL_APIR_LOG_TO_FILE="/tmp/apir.log"
export APIR_LLAMA_CPP_LOG_TO_FILE="/tmp/ggml.log"

# Guest configuration
export GGML_REMOTING_USE_APIR_CAPSET=1
```

--------------------------------

### Execute SYCL Example Script

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/backend/SYCL.md

Run SYCL examples using a provided script. Supports selecting a single device or using multiple devices automatically.

```sh
./examples/sycl/test.sh -mg 0
```

```sh
./examples/sycl/test.sh
```

--------------------------------

### Web UI Development: Run Dev Server

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README-dev.md

Starts the development server for the Web UI, enabling hot reloading for rapid development. This command should be run after installing dependencies.

```sh
# run dev server (with hot reload)
npm run dev
```

--------------------------------

### GET /props

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README.md

Retrieves the server's global properties. By default, this endpoint is read-only. To enable modifications via POST requests, the server must be started with the `--props` flag.

```APIDOC
## GET /props: Get server global properties.

### Description
Retrieves the server's global properties. This endpoint is read-only by default. To enable modifications via POST requests, the server must be started with the `--props` flag.

### Method
GET

### Endpoint
/props

### Response
#### Success Response (200)
- **default_generation_settings** (object) - The default generation settings for the `/completion` endpoint.
- **total_slots** (integer) - The total number of slots for processing requests.
- **model_path** (string) - The path to the model file.
- **chat_template** (string) - The model's original Jinja2 prompt template.
- **chat_template_caps** (object) - Capabilities of the chat template.
- **modalities** (object) - The list of supported modalities.
- **media_marker** (string) - A media marker string.
- **build_info** (string) - Build information of the server.
- **is_sleeping** (boolean) - Sleeping status of the server.

#### Response Example
```json
{
  "default_generation_settings": {
    "id": 0,
    "id_task": -1,
    "n_ctx": 1024,
    "speculative": false,
    "is_processing": false,
    "params": {
      "n_predict": -1,
      "seed": 4294967295,
      "temperature": 0.800000011920929,
      "dynatemp_range": 0.0,
      "dynatemp_exponent": 1.0,
      "top_k": 40,
      "top_p": 0.949999988079071,
      "min_p": 0.05000000074505806,
      "xtc_probability": 0.0,
      "xtc_threshold": 0.10000000149011612,
      "typical_p": 1.0,
      "repeat_last_n": 64,
      "repeat_penalty": 1.0,
      "presence_penalty": 0.0,
      "frequency_penalty": 0.0,
      "dry_multiplier": 0.0,
      "dry_base": 1.75,
      "dry_allowed_length": 2,
      "dry_penalty_last_n": -1,
      "dry_sequence_breakers": [
        "\n",
        ":",
        "\"",
        "*"
      ],
      "mirostat": 0,
      "mirostat_tau": 5.0,
      "mirostat_eta": 0.10000000149011612,
      "stop": [],
      "max_tokens": -1,
      "n_keep": 0,
      "n_discard": 0,
      "ignore_eos": false,
      "stream": true,
      "n_probs": 0,
      "min_keep": 0,
      "grammar": "",
      "samplers": [
        "dry",
        "top_k",
        "typ_p",
        "top_p",
        "min_p",
        "xtc",
        "temperature"
      ],
      "speculative.n_max": 16,
      "speculative.n_min": 5,
      "speculative.p_min": 0.8999999761581421,
      "timings_per_token": false
    },
    "prompt": "",
    "next_token": {
      "has_next_token": true,
      "has_new_line": false,
      "n_remain": -1,
      "n_decoded": 0,
      "stopping_word": ""
    }
  },
  "total_slots": 1,
  "model_path": "../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
  "chat_template": "...",
  "chat_template_caps": {},
  "modalities": {
    "vision": false
  },
  "media_marker": "<__media_YoNhud46VdDqbuFmKYEO9PY7A4ARzRfg__>",
  "build_info": "b(build number)-(build commit hash)",
  "is_sleeping": false
}
```
```

--------------------------------

### Prepare Visual Encoder Directory

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/multimodal/granitevision.md

Create a directory for visual components and copy the llava.clip and llava.projector files into it.

```bash
$ ENCODER_PATH=$PWD/visual_encoder
$ mkdir $ENCODER_PATH

$ cp $GRANITE_MODEL/llava.clip $ENCODER_PATH/pytorch_model.bin
$ cp $GRANITE_MODEL/llava.projector $ENCODER_PATH/
```

--------------------------------

### Get Server Global Properties

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README.md

Retrieves the current server global properties. This is a read-only operation by default. To enable POST requests for changing properties, start the server with the `--props` flag.

```json
{
  "default_generation_settings": {
    "id": 0,
    "id_task": -1,
    "n_ctx": 1024,
    "speculative": false,
    "is_processing": false,
    "params": {
      "n_predict": -1,
      "seed": 4294967295,
      "temperature": 0.800000011920929,
      "dynatemp_range": 0.0,
      "dynatemp_exponent": 1.0,
      "top_k": 40,
      "top_p": 0.949999988079071,
      "min_p": 0.05000000074505806,
      "xtc_probability": 0.0,
      "xtc_threshold": 0.10000000149011612,
      "typical_p": 1.0,
      "repeat_last_n": 64,
      "repeat_penalty": 1.0,
      "presence_penalty": 0.0,
      "frequency_penalty": 0.0,
      "dry_multiplier": 0.0,
      "dry_base": 1.75,
      "dry_allowed_length": 2,
      "dry_penalty_last_n": -1,
      "dry_sequence_breakers": [
        "\n",
        ":",
        "\"",
        "*"
      ],
      "mirostat": 0,
      "mirostat_tau": 5.0,
      "mirostat_eta": 0.10000000149011612,
      "stop": [],
      "max_tokens": -1,
      "n_keep": 0,
      "n_discard": 0,
      "ignore_eos": false,
      "stream": true,
      "n_probs": 0,
      "min_keep": 0,
      "grammar": "",
      "samplers": [
        "dry",
        "top_k",
        "typ_p",
        "top_p",
        "min_p",
        "xtc",
        "temperature"
      ],
      "speculative.n_max": 16,
      "speculative.n_min": 5,
      "speculative.p_min": 0.8999999761581421,
      "timings_per_token": false
    },
    "prompt": "",
    "next_token": {
      "has_next_token": true,
      "has_new_line": false,
      "n_remain": -1,
      "n_decoded": 0,
      "stopping_word": ""
    }
  },
  "total_slots": 1,
  "model_path": "../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
  "chat_template": "...",
  "chat_template_caps": {},
  "modalities": {
    "vision": false
  },
  "media_marker": "<__media_YoNhud46VdDqbuFmKYEO9PY7A4ARzRfg__>",
  "build_info": "b(build number)-(build commit hash)",
  "is_sleeping": false
}

```

--------------------------------

### Start llama-server for Benchmarking

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/bench/speed-bench/README.md

Launch the llama-server with specific configurations for benchmarking. Match the client's concurrency to the server's slot count for accurate throughput measurements.

```bash
llama-server \
  -m target.gguf \
  -c 8192 \
  --port 8080 \
  -ngl 99 -fa on \
  --np 1 \
  --jinja
```

--------------------------------

### Start Server with Hermes Template Override

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/function-calling.md

When running Hermes models, use the `--chat-template-file` argument to specify the correct Jinja template for tool interaction, ensuring proper function calling setup.

```shell
llama-server --jinja -fa -hf bartowski/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M \
    --chat-template-file models/templates/NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja
```

```shell
llama-server --jinja -fa -hf bartowski/Hermes-3-Llama-3.1-8B-GGUF:Q4_K_M \
    --chat-template-file models/templates/NousResearch-Hermes-3-Llama-3.1-8B-tool_use.jinja
```

--------------------------------

### Docker Compose with Environment Variables

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README.md

Example of configuring the llama.cpp server using Docker Compose and environment variables. This setup specifies the model path, context size, parallel processing, metrics endpoint, and server port.

```yaml
services:
  llamacpp-server:
    image: ghcr.io/ggml-org/llama.cpp:server
    ports:
      - 8080:8080
    volumes:
      - ./models:/models
    environment:
      # alternatively, you can use "LLAMA_ARG_MODEL_URL" to download the model
      LLAMA_ARG_MODEL: /models/my_model.gguf
      LLAMA_ARG_CTX_SIZE: 4096
      LLAMA_ARG_N_PARALLEL: 2
      LLAMA_ARG_ENDPOINT_METRICS: 1
      LLAMA_ARG_PORT: 8080
```

--------------------------------

### Setup Build Context for Debugging

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/development/debugging-tests.md

Prepare a clean build directory for debugging. This involves removing any existing build directory and creating a new one.

```bash
rm -rf build-ci-debug && mkdir build-ci-debug && cd build-ci-debug
```

--------------------------------

### CMakeLists.txt for GGUF Example

Source: https://github.com/anbeeld/beellama.cpp/blob/main/examples/gguf/CMakeLists.txt

This CMakeLists.txt file configures the build for a C++ executable named 'llama-gguf'. It specifies the source file, installation target, and links against the 'ggml' library and threading support. It also sets the C++ standard to C++17.

```cmake
set(TARGET llama-gguf)
add_executable(${TARGET} gguf.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE ggml ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)
```

--------------------------------

### Example Slot Status with 2 Slots

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README.md

Shows the current processing state for two available slots, including task ID, context size, processing status, and sampling parameters. This response is returned by the GET /slots endpoint.

```json
[{"id":0,"id_task":135,"n_ctx":65536,"speculative":false,"is_processing":true,"params":{"n_predict":-1,"seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":131072,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"n_probs":0,"min_keep":0,"chat_format":"GPT-OSS","reasoning_format":"none","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"next_token":{"has_next_token":true,"has_new_line":false,"n_remain":-1,"n_decoded":0}},{"id":1,"id_task":0,"n_ctx":65536,"speculative":false,"is_processing":true,"params":{"n_predict":-1,"seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":131072,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"n_probs":0,"min_keep":0,"chat_format":"GPT-OSS","reasoning_format":"none","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"next_token":{"has_next_token":true,"has_new_line":true,"n_remain":-1,"n_decoded":136}}]
```

--------------------------------

### Vulkan Shader Generation Setup

Source: https://github.com/anbeeld/beellama.cpp/blob/main/ggml/src/ggml-vulkan/CMakeLists.txt

Sets up variables and finds necessary executables for Vulkan shader generation. This includes defining output directories, shader input paths, and the command for the shader generator tool.

```cmake
set (_ggml_vk_host_suffix $<IF:$<STREQUAL:${CMAKE_HOST_SYSTEM_NAME},Windows>,.exe,>)
set (_ggml_vk_genshaders_dir "${CMAKE_BINARY_DIR}/$<CONFIG>")
set (_ggml_vk_genshaders_cmd "${_ggml_vk_genshaders_dir}/vulkan-shaders-gen${_ggml_vk_host_suffix}")
set (_ggml_vk_header     "${CMAKE_CURRENT_BINARY_DIR}/ggml-vulkan-shaders.hpp")
set (_ggml_vk_input_dir  "${CMAKE_CURRENT_SOURCE_DIR}/vulkan-shaders")
set (_ggml_vk_output_dir "${CMAKE_CURRENT_BINARY_DIR}/vulkan-shaders.spv")

file(GLOB _ggml_vk_shader_files CONFIGURE_DEPENDS "${_ggml_vk_input_dir}/*.comp")

# Because external projects do not provide source-level tracking,
# the vulkan-shaders-gen sources need to be explicitly added to
# ensure that changes will cascade into shader re-generation.

file(GLOB _ggml_vk_shaders_gen_sources
          CONFIGURE_DEPENDS "${_ggml_vk_input_dir}/*.cpp"
                           "${_ggml_vk_input_dir}/*.h")
```

--------------------------------

### Run Completion with Multiple Hexagon NPU Sessions

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/backend/snapdragon/README.md

Execute a summary request using the OLMoE-1B-7B model, requiring two Hexagon NPU sessions (HTP0, HTP1) due to its size. This example demonstrates multi-session setup and provides performance and memory breakdown.

```bash
~/src/llama.cpp$ M=OLMoE-1B-7B-0125-Instruct-Q4_0.gguf NDEV=2 D=HTP0,HTP1 ./scripts/snapdragon/adb/run-completion.sh -f surfing.txt
... 
ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
ggml-hex: Hexagon Arch version v81
ggml-hex: allocating new session: HTP0
ggml-hex: allocating new session: HTP1
... 
load_tensors: offloading output layer to GPU
load_tensors: offloaded 17/17 layers to GPU
load_tensors:          CPU model buffer size =   143.86 MiB
load_tensors:         HTP1 model buffer size =     0.23 MiB
load_tensors:  HTP1-REPACK model buffer size =  1575.00 MiB
load_tensors:         HTP0 model buffer size =     0.28 MiB
load_tensors:  HTP0-REPACK model buffer size =  2025.00 MiB
... 
llama_context:        CPU  output buffer size =     0.19 MiB
llama_kv_cache:       HTP1 KV buffer size =   238.00 MiB
llama_kv_cache:       HTP0 KV buffer size =   306.00 MiB
llama_kv_cache: size =  544.00 MiB (  8192 cells,  16 layers,  1/1 seqs), K (q8_0):  272.00 MiB, V (q8_0):  272.00 MiB
llama_context:       HTP0 compute buffer size =    15.00 MiB
llama_context:       HTP1 compute buffer size =    15.00 MiB
llama_context:        CPU compute buffer size =    24.56 MiB
... 
llama_perf_context_print: prompt eval time =    1730.57 ms /   212 tokens (    8.16 ms per token,   122.50 tokens per second)
llama_perf_context_print:        eval time =    5624.75 ms /   257 runs   (   21.89 ms per token,    45.69 tokens per second)
llama_perf_context_print:       total time =    7377.33 ms /   469 tokens
llama_perf_context_print:    graphs reused =        255
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - HTP0 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
llama_memory_breakdown_print: |   - HTP1 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
llama_memory_breakdown_print: |   - Host               |                  742 =   144 +     544 +      54                |
llama_memory_breakdown_print: |   - HTP1-REPACK        |                 1575 =  1575 +       0 +       0                |
llama_memory_breakdown_print: |   - HTP0-REPACK        |                 2025 =  2025 +       0 +       0                |
```

--------------------------------

### Install Hexagon Skels

Source: https://github.com/anbeeld/beellama.cpp/blob/main/ggml/src/ggml-hexagon/CMakeLists.txt

Installs Hexagon skels required at runtime. This is a basic installation command.

```cmake
install(FILES ${HTP_SKELS} TYPE LIB)
```

--------------------------------

### Install and Set Up Environment on Target Device

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/backend/snapdragon/linux.md

Transfer the zipped package to the target Linux device, unzip it, and set the LD_LIBRARY_PATH and ADSP_LIBRARY_PATH environment variables to include the libraries.

```bash
$ unzip pkg-snapdragon.zip
$ cd pkg-snapdragon
$ export LD_LIBRARY_PATH=./lib
$ export ADSP_LIBRARY_PATH=./lib
```

--------------------------------

### OpenAI-Compatible Server Commands for BeeLlama.cpp

Source: https://github.com/anbeeld/beellama.cpp/blob/main/README.md

Shows how to start the OpenAI-compatible server for BeeLlama.cpp. Use these commands to expose models via an API.

```sh
llama-server -m model.gguf --port 8080
```

```sh
llama-server -m model.gguf -c 16384 -np 4
```

```sh
llama-server -m model.gguf -md draft.gguf
```

--------------------------------

### Install gguf Python Package

Source: https://github.com/anbeeld/beellama.cpp/blob/main/gguf-py/README.md

Install the gguf package using pip. This is the basic installation for using the package's functionalities.

```sh
pip install gguf
```

--------------------------------

### Web UI Development: Install Dependencies

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README-dev.md

Installs the necessary Node.js dependencies for the Web UI development server. Ensure Node.js is installed before running.

```sh
# make sure you have Node.js installed
cd tools/ui
npm i
```

--------------------------------

### Quick Start llama-server on Windows

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README.md

Start the llama-server on Windows. Use the .exe executable and specify the correct path for the model file and context size.

```powershell
llama-server.exe -m models\7B\ggml-model.gguf -c 2048
```

--------------------------------

### Start Server with Native Tool-Aware Jinja Support

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/function-calling.md

Use this command to start a server with native support for tool-aware Jinja templates. Ensure the model specified has a compatible chat template.

```shell
llama-server --jinja -fa -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M
```

```shell
llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L
```

```shell
llama-server --jinja -fa -hf bartowski/Llama-3.3-70B-Instruct-GGUF:Q4_K_M
```

```shell
llama-server --jinja -fa -hf bartowski/granite-4.1-3b-GGUF:Q4_K_M
```

--------------------------------

### Verify CUDA Installation

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/backend/CUDA-FEDORA.md

Checks the installed version of the NVIDIA CUDA Compiler (nvcc). This command confirms that CUDA is correctly installed and accessible in the PATH.

```bash
nvcc --version
```

--------------------------------

### Upgrade Pip for Editable Installation

Source: https://github.com/anbeeld/beellama.cpp/blob/main/gguf-py/README.md

If an editable installation requires an upgrade to Pip, use this command to install the latest version. This ensures compatibility with modern packaging standards.

```sh
pip install --upgrade pip
```

--------------------------------

### Example BeeLlama DFlash Launch Script

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/beellama-args.md

This example demonstrates a typical DFlash launch script configuration for BeeLlama, including model paths, speculative decoding settings, context sizes, KV cache precision, and logging options. It serves as a template for setting up a high-performance BeeLlama server.

```shell
llama-server \
  -m "path/to/target.gguf" \
  --mmproj "path/to/mmproj.gguf" \
  --no-mmproj-offload \
  --spec-draft-model "path/to/drafter.gguf" \
  --spec-type dflash \
  --spec-dflash-cross-ctx 1024 \
  --port 8082 \
  -np 1 \
  --kv-unified \
  -ngl all \
  --spec-draft-ngl all \
  -b 2048 -ub 512 \
  --ctx-size 102400 \
  --cache-type-k q5_0 --cache-type-v q4_0 \
  --flash-attn on \
  --cache-ram 0 \
  --jinja \
  --no-mmap --mlock \
  --no-host --metrics \
  --log-timestamps --log-prefix --log-colors off \
  --reasoning on \
  --chat-template-kwargs '{"preserve_thinking":true}' \
  --temp 0.6 --top-k 20 --min-p 0.0
```

--------------------------------

### Install OpenCL Headers for Windows Arm64

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/build.md

Installs OpenCL headers and the ICD loader for Windows Arm64. CMake is used with Ninja generator, and installation paths are specified.

```powershell
mkdir -p ~/dev/llm

cd ~/dev/llm
git clone https://github.com/KhronosGroup/OpenCL-Headers && cd OpenCL-Headers
mkdir build && cd build
cmake .. -G Ninja \
  -DBUILD_TESTING=OFF \
  -DOPENCL_HEADERS_BUILD_TESTING=OFF \
  -DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF \
  -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
cmake --build . --target install

cd ~/dev/llm
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && cd OpenCL-ICD-Loader
mkdir build && cd build
cmake .. -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \
  -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
cmake --build . --target install
```

--------------------------------

### Custom Batch Configuration Example

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/batched-bench/README.md

Run the benchmark with a specific set of prompt tokens per batch, tokens per generation, and number of parallel batches.

```bash
./llama-batched-bench -m ./models/llama-7b/ggml-model-q8_0.gguf -c 2048 -b 512 -ub 512 -ngl 999 -npp 128,256,512 -ntg 128,256 -npl 1,2,4,8,16,32
```

--------------------------------

### Install llama.cpp via Nix (Non-flake-enabled)

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/install.md

Use this command to install llama.cpp on Mac and Linux for non-flake-enabled Nix installations. This expression is automatically updated within the nixpkgs repository.

```sh
nix-env --file '<nixpkgs>' --install --attr llama-cpp
```

--------------------------------

### Install llama.cpp via Nix (Flake-enabled)

Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/install.md

Use this command to install llama.cpp on Mac and Linux for flake-enabled Nix installations. This expression is automatically updated within the nixpkgs repository.

```sh
nix profile install nixpkgs#llama-cpp
```

--------------------------------

### Install Build and Twine for Manual Publishing

Source: https://github.com/anbeeld/beellama.cpp/blob/main/gguf-py/README.md

Install the 'build' and 'twine' packages, which are necessary for manually building and uploading the Python package to PyPI.

```sh
pip install build twine
```

--------------------------------

### Apply Multiple Scaled LoRA Adapters

Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/export-lora/README.md

Example demonstrating how to apply multiple LoRA adapters with custom scaling factors to a base model.

```bash
./bin/llama-export-lora \
    -m your_base_model.gguf \
    -o your_merged_model.gguf \
    --lora-scaled lora_task_A.gguf 0.5 \
    --lora-scaled lora_task_B.gguf 0.5

```

--------------------------------

### Conditional Installation

Source: https://github.com/anbeeld/beellama.cpp/blob/main/app/CMakeLists.txt

Installs the 'llama-app' target if the LLAMA_TOOLS_INSTALL build option is enabled.

```cmake
if(LLAMA_TOOLS_INSTALL)
    install(TARGETS ${TARGET} RUNTIME)
endif()
```