### Installing k6 with SSE Extension (Shell)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/server/bench/README.md

This snippet demonstrates how to install the k6 load testing tool and build it with the `xk6-sse` extension, which provides Server-Sent Events support. This is necessary because SSE is not included in the default k6 build. It uses `go install` to get the `xk6` tool and then `xk6 build` to compile k6 with the specified extension.

```shell
go install go.k6.io/xk6/cmd/xk6@latest
xk6 build master \
--with github.com/phymbert/xk6-sse
```

--------------------------------

### Installing LLaVA Python Dependencies (General)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/llava/README.md

This command installs all necessary Python packages listed in `examples/llava/requirements.txt`. These dependencies are crucial for running the LLaVA model processing and conversion scripts.

```sh
pip install -r examples/llava/requirements.txt
```

--------------------------------

### Running SimpleChat via Llama Server (Quickstart) - Shell

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/server/public_simplechat/readme.md

This command launches the `llama-server` executable, loading a specified GGUF model and serving the SimpleChat web frontend from the `examples/server/public_simplechat` directory. It provides a quick way to get the server and UI running for immediate testing.

```Shell
bin/llama-server -m path/model.gguf --path ../examples/server/public_simplechat
```

--------------------------------

### Starting Interactive Conversation Mode (Unix Bash)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/main/README.md

Launches `llama-cli` in continuous conversation mode on Unix-based systems. It loads the `gemma-1.1-7b-it.Q4_K_M.gguf` model and applies the 'gemma' chat template, enabling ongoing user interaction with the model.

```Bash
./llama-cli -m models/gemma-1.1-7b-it.Q4_K_M.gguf -cnv --chat-template gemma
```

--------------------------------

### Starting Interactive Conversation Mode (Windows PowerShell)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/main/README.md

Launches `llama-cli.exe` in continuous conversation mode on Windows. It loads the `gemma-1.1-7b-it.Q4_K_M.gguf` model and applies the 'gemma' chat template, enabling ongoing user interaction with the model.

```PowerShell
./llama-cli.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf -cnv --chat-template gemma
```

--------------------------------

### Setting up User for Ascend Driver Installation

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/backend/CANN.md

These commands create a new user group 'HwHiAiUser' and a new user 'HwHiAiUser' with a home directory, then add the current user to this newly created group. This setup is a prerequisite for installing the Ascend driver.

```sh
sudo groupadd -g HwHiAiUser
sudo useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser -s /bin/bash
sudo usermod -aG HwHiAiUser $USER
```

--------------------------------

### Executing SYCL Device Listing Tool (Bash)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/sycl/README.md

This command executes the `llama-ls-sycl-device` tool, which is designed to list all available SYCL devices. It provides detailed information such as device ID, type, name, version, compute units, max work group size, and global memory size, useful for verifying SYCL setup.

```Bash
./build/bin/llama-ls-sycl-device
```

--------------------------------

### Running llama.cpp Retrieval Example (Bash)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/retrieval/README.md

This command compiles the `llama.cpp` project and then executes the `llama-retrieval` example. It specifies a pre-trained model, sets the number of top similar chunks to retrieve (`--top-k 3`), and defines the input context files (`README.md`, `License`). It also configures chunking parameters (`--chunk-size 100`) and the chunk separator (`.`).

```bash
make -j && ./llama-retrieval --model ./models/bge-base-en-v1.5-f16.gguf --top-k 3 --context-file README.md --context-file License --chunk-size 100 --chunk-separator .
```

--------------------------------

### Installing Ascend NPU Firmware

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/backend/CANN.md

This command executes the Ascend NPU firmware installer script with the full installation option. Successful execution is indicated by a confirmation message, ensuring the firmware is correctly installed for NPU operation.

```sh
sudo sh Ascend-hdk-910b-npu-firmware_x.x.x.x.X.run --full
```

--------------------------------

### Installing Ascend NPU Driver

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/backend/CANN.md

This command executes the Ascend NPU driver installer script with full installation options for all users. Users must download the appropriate driver version for their system from the official Ascend website before running this command.

```sh
sudo sh Ascend-hdk-910b-npu-driver_x.x.x_linux-{arch}.run --full --install-for-all
```

--------------------------------

### Running Interactive Chat Sample (Bash)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/server/README.md

This shell command executes the chat.sh script using Bash, providing another interactive chat example. It depends on bash, curl, and jq being installed.

```sh
bash chat.sh
```

--------------------------------

### Starting llama-server on Windows

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/server/README.md

This command starts the `llama-server` executable on Windows. It loads the specified GGUF model (`-m`) and sets the context size (`-c`). The server defaults to listening on `127.0.0.1:8080`.

```powershell
llama-server.exe -m models\7B\ggml-model.gguf -c 2048
```

--------------------------------

### Running Llama.cpp Eval Callback Example (Shell)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/eval-callback/README.md

This shell command demonstrates how to execute the `llama-eval-callback` example. It specifies the Hugging Face repository and file for the model, the local model filename, an initial prompt, a seed for reproducibility, and offloads 33 layers to the GPU for accelerated inference.

```shell
llama-eval-callback \
  --hf-repo ggml-org/models \
  --hf-file phi-2/ggml-model-q4_0.gguf \
  --model phi-2-q4_0.gguf \
  --prompt hello \
  --seed 42 \
  -ngl 33
```

--------------------------------

### Verifying Ascend Driver Installation

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/backend/CANN.md

This command checks the status of the Ascend NPU driver and devices after installation. The output provides details such as NPU health, power consumption, temperature, and memory usage, confirming successful driver setup.

```sh
npu-smi info
```

--------------------------------

### Installing Vulkan SDK on Ubuntu 22.04

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/build.md

This sequence of commands adds the LunarG Vulkan repository to Ubuntu 22.04, updates package lists, and installs the `vulkan-sdk` package. The final command `vulkaninfo` is used to verify the successful installation of the SDK.

```bash
wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key add -
wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
apt update -y
apt-get install -y vulkan-sdk
# To verify the installation, use the command below:
vulkaninfo
```

--------------------------------

### Generating Text with a Single Prompt (Unix Bash)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/main/README.md

Executes `llama-cli` on Unix-based systems to generate text from a specified LLaMA model using a direct command-line prompt. This 'one-and-done' mode loads `gemma-1.1-7b-it.Q4_K_M.gguf` and starts text generation with 'Once upon a time'.

```Bash
./llama-cli -m models/gemma-1.1-7b-it.Q4_K_M.gguf --prompt "Once upon a time"
```

--------------------------------

### Example: Running GDB with Specific Test and Model (Bash)

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/development/debugging-tests.md

This concrete example demonstrates how to initiate a GDB session for a specific test. It directly provides the path to the test binary and its associated GGUF model file, as identified from the `ctest -N` output.

```Bash
gdb --args ~/llama.cpp/build-ci-debug/bin/test-tokenizer-0 "~/llama.cpp/tests/../models/ggml-vocab-llama-spm.gguf"
```

--------------------------------

### Example Workflow: Generating and Applying Importance Matrix for Quantization

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/imatrix/README.md

This example demonstrates a full workflow: first, compiling `llama.cpp` with CUDA support, then generating an importance matrix (`imatrix.dat`) using `llama-imatrix` with GPU offloading, and finally, applying this matrix during a Q4_K_M quantization of the model using `llama-quantize`.

```Bash
GGML_CUDA=1 make -j

# generate importance matrix (imatrix.dat)
./llama-imatrix -m ggml-model-f16.gguf -f train-data.txt -ngl 99

# use the imatrix to perform a Q4_K_M quantization
./llama-quantize --imatrix imatrix.dat ggml-model-f16.gguf ./ggml-model-q4_k_m.gguf q4_k_m
```

--------------------------------

### Starting RPC Server (Bash)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/rpc/README.md

This command starts the `rpc-server` executable, listening on port 50052. It initializes the CUDA backend and displays information about detected CUDA devices, indicating the server is ready to accept connections.

```bash
bin/rpc-server -p 50052
```

--------------------------------

### Starting llama-server on Unix-based Systems

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/server/README.md

This command starts the `llama-server` on Unix-based systems. It loads the specified GGUF model (`-m`) and sets the context size (`-c`). The server defaults to listening on `127.0.0.1:8080`.

```bash
./llama-server -m models/7B/ggml-model.gguf -c 2048
```

--------------------------------

### Conditionally Adding Examples and POCs Subdirectories - CMake

Source: https://github.com/lizonghang/prima.cpp/blob/main/CMakeLists.txt

This snippet conditionally includes the `examples` and `pocs` (Proof of Concepts) subdirectories in the build. It checks if the `LLAMA_BUILD_EXAMPLES` CMake variable is true. This allows for building example applications and experimental code only when desired, reducing build time for core library development.

```CMake
if (LLAMA_BUILD_EXAMPLES)
    add_subdirectory(examples)
    add_subdirectory(pocs)
endif()
```

--------------------------------

### Installing Python Packaging Tools (Build, Twine)

Source: https://github.com/lizonghang/prima.cpp/blob/main/gguf-py/README.md

Installs `build` and `twine`, essential Python packages for building distribution archives and uploading them to package indexes like PyPI, respectively.

```sh
pip install build twine
```

--------------------------------

### Defining and Installing Executable in CMake

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/gguf-hash/CMakeLists.txt

This snippet defines the `llama-gguf-hash` executable from `gguf-hash.cpp` and configures it for installation. The `set` command defines a variable `TARGET` for the executable name, `add_executable` creates the target, and `install` ensures it's placed in the runtime directory during installation.

```CMake
set(TARGET llama-gguf-hash)
add_executable(${TARGET} gguf-hash.cpp)
install(TARGETS ${TARGET} RUNTIME)
```

--------------------------------

### Starting prima.cpp in Server Mode (Rank 1 Client)

Source: https://github.com/lizonghang/prima.cpp/blob/main/README.md

This command launches 'llama-cli' on a non-rank 0 device (rank 1 in this example) to connect to the 'llama-server' running on the rank 0 device. It specifies the model, world size, rank, and master/next IP addresses, enabling distributed inference in a server-client configuration.

```shell
# On rank 1, run:
./llama-cli -m download/qwq-32b-q4_k_m.gguf --world 2 --rank 1 --master 192.168.1.2 --next 192.168.1.2 --prefetch
```

--------------------------------

### Verifying SYCL Device Installation (Windows Command Prompt)

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/backend/SYCL.md

Executing sycl-ls.exe in the oneAPI command line lists all available SYCL devices. This command is used to confirm that Intel GPU devices are correctly detected as [ext_oneapi_level_zero:gpu], indicating a successful SYCL and Level-Zero driver setup.

```cmd
sycl-ls.exe
```

--------------------------------

### Running GritLM Example for Similarity and Text Generation

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/gritlm/README.md

This command executes the `llama-gritlm` program, loading the previously downloaded `gritlm-7b_q4_1.gguf` model. The output demonstrates both cosine similarity calculations between different phrases and a generated poetic text, showcasing the model's dual capabilities in embedding generation and creative text generation.

```console
./llama-gritlm -m models/gritlm-7b_q4_1.gguf

Cosine similarity between "Bitcoin: A Peer-to-Peer Electronic Cash System" and "A purely peer-to-peer version of electronic cash w" is: 0.605
Cosine similarity between "Bitcoin: A Peer-to-Peer Electronic Cash System" and "All text-based language problems can be reduced to" is: 0.103
Cosine similarity between "Generative Representational Instruction Tuning" and "A purely peer-to-peer version of electronic cash w" is: 0.112
Cosine similarity between "Generative Representational Instruction Tuning" and "All text-based language problems can be reduced to" is: 0.547

Oh, brave adventurer, who dared to climb
The lofty peak of Mt. Fuji in the night,
When shadows lurk and ghosts do roam,
And darkness reigns, a fearsome sight.

Thou didst set out, with heart aglow,
To conquer this mountain, so high,
And reach the summit, where the stars do glow,
And the moon shines bright, up in the sky.

Through the mist and fog, thou didst press on,
With steadfast courage, and a steadfast will,
Through the darkness, thou didst not be gone,
But didst climb on, with a steadfast skill.

At last, thou didst reach the summit's crest,
And gazed upon the world below,
And saw the beauty of the night's best,
And felt the peace, that only nature knows.

Oh, brave adventurer, who dared to climb
The lofty peak of Mt. Fuji in the night,
Thou art a hero, in the eyes of all,
For thou didst conquer this mountain, so bright.
```

--------------------------------

### Compiling BLIS from Source

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/backend/BLIS.md

This snippet outlines the steps to clone, configure, and compile the BLIS library from its GitHub repository. It enables CBLAS compatibility and configures multithreading using OpenMP and pthreads, installing to /usr/local/ by default.

```bash
git clone https://github.com/flame/blis
cd blis
./configure --enable-cblas -t openmp,pthreads auto
# will install to /usr/local/ by default.
make -j
```

--------------------------------

### Installing Ascend CANN Toolkit and Kernels

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/backend/CANN.md

These commands install the Ascend CANN toolkit and its corresponding kernels. The CANN toolkit provides the core software stack for developing and running AI applications on Ascend devices, requiring a minimum version of 8.0.RC2.alpha002.

```sh
sh Ascend-cann-toolkit_8.0.RC2.alpha002_linux-aarch64.run --install
sh Ascend-cann-kernels-910b_8.0.RC2.alpha002_linux.run --install
```

--------------------------------

### Starting prima.cpp in Server Mode (Rank 0)

Source: https://github.com/lizonghang/prima.cpp/blob/main/README.md

This command launches the 'llama-server' on the rank 0 device, enabling prima.cpp to operate in server mode. It configures the server with the model path, context size, world size, rank, master/next IP addresses, and specifies the host and port for API access. This setup allows other devices to connect as clients for distributed inference.

```shell
# On rank 0, run:
./llama-server -m download/qwq-32b-q4_k_m.gguf -c 1024 --world 2 --rank 0 --master 192.168.1.2 --next 192.168.1.3 --prefetch --host 127.0.0.1 --port 8080
```

--------------------------------

### Installing Core Prerequisites on Linux

Source: https://github.com/lizonghang/prima.cpp/blob/main/README.md

This snippet provides a shell command to install essential build tools and libraries required for the project on Ubuntu or other Debian-based Linux distributions. It includes `gcc`, `make`, `cmake`, `fio`, `git`, `wget`, and `libzmq3-dev`.

```shell
sudo apt update -y && sudo apt install -y gcc-9 make cmake fio git wget libzmq3-dev
```

--------------------------------

### Starting Server Tests

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/server/tests/README.md

This command executes the main test script, initiating the server tests. It's the primary way to run the BDD scenarios after the server has been built.

```Shell
./tests.sh
```

--------------------------------

### Installing Core Prerequisites on macOS

Source: https://github.com/lizonghang/prima.cpp/blob/main/README.md

This snippet provides a Homebrew command to install the necessary dependencies for the project on macOS. It includes `gcc`, `make`, `cmake`, `fio`, `git`, `wget`, `highs`, and `zeromq`.

```shell
brew install gcc make cmake fio git wget highs zeromq
```

--------------------------------

### Building External Project Using Installed llama.cpp CMake Package (CMD)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/main-cmake-pkg/README.md

This snippet demonstrates how to build an external project, `llama-cli-cmake-pkg`, by referencing the previously installed llama.cpp CMake package. It configures the build using `CMAKE_PREFIX_PATH` to locate the Llama package, compiles the project, and installs the resulting application to `C:/MyLlamaApp`.

```cmd
cd ..\examples\main-cmake-pkg
cmake -B build -DBUILD_SHARED_LIBS=OFF -DCMAKE_PREFIX_PATH="C:/LlamaCPP/lib/cmake/Llama" -G "Visual Studio 17 2022" -A x64
cmake --build build --config Release
cmake --install build --prefix C:/MyLlamaApp
```

--------------------------------

### Installing Python Test Dependencies

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/server/tests/README.md

This command installs the necessary Python packages for running the server tests. It reads dependencies from the `requirements.txt` file, ensuring all required libraries like aiohttp and asyncio are available.

```Shell
pip install -r requirements.txt
```

--------------------------------

### Installing the GGUF Python Package

Source: https://github.com/lizonghang/prima.cpp/blob/main/gguf-py/README.md

Installs the `gguf` Python package using pip, making it available for use in Python projects. This is the standard way to get the package.

```sh
pip install gguf
```

--------------------------------

### Running Automated Benchmark with Python CI Script (Shell)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/server/bench/README.md

This command executes the `bench.py` Python script, designed for CI/CD environments, to automate the entire benchmarking process. It sets the path to the `llama-server` binary and configures various benchmark parameters, including runner label, name, Git branch/commit, k6 scenario, duration, Hugging Face model details, and server-specific settings like parallel processing, batch sizes, context size, and token limits. This script streamlines server startup, k6 execution, and metric extraction.

```shell
LLAMA_SERVER_BIN_PATH=../../../cmake-build-release/bin/llama-server python bench.py \
              --runner-label local \
              --name local \
              --branch `git rev-parse --abbrev-ref HEAD` \
              --commit `git rev-parse HEAD` \
              --scenario script.js \
              --duration 5m \
              --hf-repo ggml-org/models\t \
              --hf-file phi-2/ggml-model-q4_0.gguf \
              --model-path-prefix models \
              --parallel 4 \
              -ngl 33 \
              --batch-size 2048 \
              --ubatch-size\t256 \
              --ctx-size 4096 \
              --n-prompts 200 \
              --max-prompt-tokens 256 \
              --max-tokens 256
```

--------------------------------

### Executing llama.cpp for Text Generation (Bash)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/simple/README.md

This `bash` command runs the `llama-simple` executable, a basic example from `llama.cpp`, to perform text generation. It requires specifying the path to a GGUF model file using the `-m` flag and the initial prompt string using the `-p` flag. The command demonstrates how to initiate a text generation task and provides an example of the expected output, including performance timings.

```bash
./llama-simple -m ./models/llama-7b-v2/ggml-model-f16.gguf -p "Hello my name is"
```

--------------------------------

### Installing Python Dependencies for CANN Toolkit

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/backend/CANN.md

This command installs a list of Python packages using pip3. These packages are essential dependencies for the Ascend CANN toolkit, ensuring its Python components function correctly.

```sh
pip3 install attrs numpy decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions
```

--------------------------------

### Setting up Node.js Client Directory

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/server/README.md

These commands create a new directory named `llama-client` and then navigate into it. This prepares the environment for creating a Node.js script to interact with the `llama-server`.

```bash
mkdir llama-client
cd llama-client
```

--------------------------------

### Creating Executable, Linking Libraries, and Installing - CMake

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/main-cmake-pkg/CMakeLists.txt

This final section creates the main executable using `main.cpp`, includes the common utility path, and specifies the installation of the executable for runtime. It then links the `common` object library, the `llama` library, and thread libraries, and sets the C++ standard to C++11 for the executable target.

```CMake
add_executable(${TARGET} ${CMAKE_CURRENT_LIST_DIR}/../main/main.cpp)
target_include_directories(${TARGET} PRIVATE ${_common_path})
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_11)
```

--------------------------------

### Running Llama Passkey Example in Bash

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/passkey/README.md

This command compiles the `llama.cpp` project using `make -j` for parallel compilation, then executes the `llama-passkey` example. It specifies a model file (`ggml-model-f16.gguf`) using the `-m` flag and includes a `--junk` parameter with a value of 250, likely used to add irrelevant data to test the model's long context recall capabilities.

```bash
make -j && ./llama-passkey -m ./models/llama-7b-v2/ggml-model-f16.gguf --junk 250
```

--------------------------------

### Building and Installing llama-quantize-stats Executable with CMake

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/quantize-stats/CMakeLists.txt

This CMake snippet defines the `llama-quantize-stats` executable, specifies its source file (`quantize-stats.cpp`), sets it for installation, links it against `llama` and `build_info` libraries, includes common directories, and enforces C++11 standard compliance. It ensures the executable is properly built and made available.

```CMake
set(TARGET llama-quantize-stats)
add_executable(${TARGET} quantize-stats.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE llama build_info ${CMAKE_THREAD_LIBS_INIT})
target_include_directories(${TARGET} PRIVATE ../../common)
target_compile_features(${TARGET} PRIVATE cxx_std_11)
```

--------------------------------

### Generating Text with a Single Prompt (Windows PowerShell)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/main/README.md

Executes `llama-cli.exe` on Windows to generate text from a specified LLaMA model using a direct command-line prompt. This 'one-and-done' mode loads `gemma-1.1-7b-it.Q4_K_M.gguf` and starts text generation with 'Once upon a time'.

```PowerShell
./llama-cli.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf --prompt "Once upon a time"
```

--------------------------------

### Command-line Usage Syntax for llama-bench

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/llama-bench/README.md

This snippet outlines the complete command-line syntax for the `llama-bench` performance testing tool. It lists all available options, their default values, and describes parameters for model selection, prompt and generation lengths, batching, thread management, GPU layer offloading, and various output formats. It also highlights that most options can accept multiple comma-separated values or be specified multiple times to run comparative tests.

```Shell
usage: ./llama-bench [options]

options:
  -h, --help
  -m, --model <filename>                    (default: models/7B/ggml-model-q4_0.gguf)
  -p, --n-prompt <n>                        (default: 512)
  -n, --n-gen <n>                           (default: 128)
  -pg <pp,tg>                               (default: )
  -b, --batch-size <n>                      (default: 2048)
  -ub, --ubatch-size <n>                    (default: 512)
  -ctk, --cache-type-k <t>                  (default: f16)
  -ctv, --cache-type-v <t>                  (default: f16)
  -t, --threads <n>                         (default: 8)
  -C, --cpu-mask <hex,hex>                  (default: 0x0)
  --cpu-strict <0|1>                        (default: 0)
  --poll <0...100>                          (default: 50)
  -ngl, --n-gpu-layers <n>                  (default: 99)
  -rpc, --rpc <rpc_servers>                 (default: )
  -sm, --split-mode <none|layer|row>        (default: layer)
  -mg, --main-gpu <i>                       (default: 0)
  -nkvo, --no-kv-offload <0|1>              (default: 0)
  -fa, --flash-attn <0|1>                   (default: 0)
  -mmp, --mmap <0|1>                        (default: 1)
  --numa <distribute|isolate|numactl>       (default: disabled)
  -embd, --embeddings <0|1>                 (default: 0)
  -ts, --tensor-split <ts0/ts1/..>          (default: 0)
  -r, --repetitions <n>                     (default: 5)
  --prio <0|1|2|3>                          (default: 0)
  --delay <0...N> (seconds)                 (default: 0)
  -o, --output <csv|json|jsonl|md|sql>      (default: md)
  -oe, --output-err <csv|json|jsonl|md|sql> (default: none)
  -v, --verbose                             (default: 0)
```

--------------------------------

### Sample Output of clinfo -l

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/backend/SYCL.md

This snippet shows an example of the expected output when executing `clinfo -l` after a successful Intel GPU driver installation. It displays the detected Intel OpenCL Graphics platforms and their associated devices, such as Intel Arc and Iris Xe Graphics.

```sh
Platform #0: Intel(R) OpenCL Graphics
 `-- Device #0: Intel(R) Arc(TM) A770 Graphics

Platform #0: Intel(R) OpenCL HD Graphics
 `-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49]
```

--------------------------------

### Quantizing LLaMA Models: Setup and Conversion (Bash)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/quantize/README.md

This snippet provides a sequence of bash commands to set up the environment, convert Hugging Face models to GGUF FP16 format, and quantize them to 4-bits using the Q4_K_M method. It includes optional steps for different tokenizer types and updating GGUF file versions. Dependencies include Python and `llama.cpp` tools.

```bash
# obtain the official LLaMA model weights and place them in ./models
ls ./models
llama-2-7b tokenizer_checklist.chk tokenizer.model
# [Optional] for models using BPE tokenizers
ls ./models
<folder containing weights and tokenizer json> vocab.json
# [Optional] for PyTorch .bin models like Mistral-7B
ls ./models
<folder containing weights and tokenizer json>

# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the model to ggml FP16 format
python3 convert_hf_to_gguf.py models/mymodel/

# quantize the model to 4-bits (using Q4_K_M method)
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

# update the gguf filetype to current version if older version is now unsupported
./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
```

--------------------------------

### Cloning llama.cpp Repository

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/llava/README-minicpmv2.5.md

This snippet provides the commands to clone the llama.cpp repository from GitHub and navigate into its directory. This is the initial step required to set up the build environment for llama.cpp.

```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
```

--------------------------------

### Defining a Root Rule for Lists in GBNF

Source: https://github.com/lizonghang/prima.cpp/blob/main/grammars/README.md

This GBNF snippet defines a `root` rule for a list format, where each item starts with `"- "` followed by the content of an `item` and then a newline. The `item` rule itself matches any characters until a newline. The `+` quantifier ensures that the list must contain one or more items.

```GBNF
# a grammar for lists
root ::= ("- " item)+
item ::= [^\n]+ "\n"
```

--------------------------------

### Testing llama.cpp with CANN Backend

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/build.md

This command executes the compiled `llama-cli` binary with a specified model and prompt, offloading 32 layers to the NPU. It serves to verify the successful integration and usage of the CANN backend for Ascend NPU acceleration.

```bash
./build/llama-cli -m PATH_TO_MODEL -p "Building a website can be done in 10 steps:" -ngl 32
```

--------------------------------

### Example GBNF Grammar for Name-Age JSON Array

Source: https://github.com/lizonghang/prima.cpp/blob/main/grammars/README.md

This GBNF grammar defines the structure for a JSON array containing objects with 'name' (string) and 'age' (integer) fields. It specifies rules for characters, individual items, and the overall array structure, ensuring adherence to the defined JSON schema constraints.

```gbnf
char ::= [^\"\\\x7F\x00-\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})
item ::= \"{\" space item-name-kv \",\" space item-age-kv \"}\" space
item-age ::= ([0-9] | ([1-8] [0-9] | [9] [0-9]) | \"1\" ([0-4] [0-9] | [5] \"0\")) space
item-age-kv ::= \"\\"age\\"\" space \":\" space item-age
item-name ::= \"\\"\\"\" char{1,100} \"\\"\\"\" space
item-name-kv ::= \"\\"name\\"\" space \":\" space item-name
root ::= \"\\[\" space item (\",\" space item){9,99} \"\\]\" space
space ::= | \" \" | \"\\n\" [ \\t]{0,20}
```

--------------------------------

### Starting the Local Server for Benchmarking (Shell)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/server/bench/README.md

This command initiates the server application, configuring it to listen on `localhost:8080` and load the `ggml-model-q4_0.gguf` model. It enables continuous batching, metrics collection, and sets various performance parameters like parallel processing, batch size, context size, and the number of GPU layers (`-ngl`). The server will handle OAI Chat completion requests.

```shell
server --host localhost --port 8080 \
  --model ggml-model-q4_0.gguf \
  --cont-batching \
  --metrics \
  --parallel 8 \
  --batch-size 512 \
  --ctx-size 4096 \
  -ngl 33
```

--------------------------------

### Installing llama.cpp with Nix (Flake-enabled Shell)

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/install.md

This command installs llama.cpp using the Nix package manager for systems with flake-enabled Nix installations. It fetches the package from the nixpkgs repository.

```sh
nix profile install nixpkgs#llama-cpp
```

--------------------------------

### Installing llama.cpp with Nix (Non-flake Shell)

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/install.md

This command installs llama.cpp using the Nix package manager for systems without flake-enabled Nix installations. It uses the traditional nix-env approach.

```sh
nix-env --file '<nixpkgs>' --install --attr llama-cpp
```

--------------------------------

### Preparing LLaVA v1.6 CLIP Files for Conversion

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/llava/README.md

These commands prepare the `llava.clip` and `llava.projector` files for conversion by creating a `vit` subdirectory, copying the files, renaming `llava.clip` to `pytorch_model.bin`, and downloading a `config.json` for the vision model.

```sh
mkdir vit
cp ../llava-v1.6-vicuna-7b/llava.clip vit/pytorch_model.bin
cp ../llava-v1.6-vicuna-7b/llava.projector vit/
curl -s -q https://huggingface.co/cmp-nct/llava-1.6-gguf/raw/main/config_vit.json -o vit/config.json
```

--------------------------------

### Conditionally Installing GGML Metal Resources (CMake)

Source: https://github.com/lizonghang/prima.cpp/blob/main/ggml/CMakeLists.txt

This section handles the installation of Metal-specific resources if `GGML_METAL` is enabled. It installs the `ggml-metal.metal` source file with specific permissions and, if `GGML_METAL_EMBED_LIBRARY` is not set, also installs the compiled `default.metallib`.

```CMake
if (GGML_METAL)
    install(
        FILES src/ggml-metal.metal
        PERMISSIONS
            OWNER_READ
            OWNER_WRITE
            GROUP_READ
            WORLD_READ
        DESTINATION ${CMAKE_INSTALL_BINDIR})

    if (NOT GGML_METAL_EMBED_LIBRARY)
        install(
            FILES ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/default.metallib
            DESTINATION ${CMAKE_INSTALL_BINDIR}
        )
    endif()
endif()
```

--------------------------------

### Running SimpleChat via Llama Server (Detailed) - Shell

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/server/public_simplechat/readme.md

This command starts the `llama-server`, loading a GGUF model and serving the SimpleChat frontend from `examples/server/public_simplechat`. An optional `PORT` argument allows specifying the listening port, providing flexibility for deployment scenarios.

```Shell
./llama-server -m path/model.gguf --path examples/server/public_simplechat [--port PORT]
```

--------------------------------

### Installing llama.cpp with Homebrew (Shell)

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/install.md

This command installs the llama.cpp library using the Homebrew package manager. It is suitable for Mac and Linux users who have Homebrew configured.

```sh
brew install llama.cpp
```

--------------------------------

### Installing llama.cpp with Flox (Shell)

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/install.md

This command installs llama.cpp within a Flox environment using the Flox package manager. Flox leverages the nixpkgs build of llama.cpp.

```sh
flox install llama-cpp
```

--------------------------------

### Running LLaVA v1.5 Model with CLI

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/llava/README.md

This command executes the `llama-llava-cli` with a LLaVA v1.5 7B model and its corresponding multimodal projector. It requires paths to the GGUF model files and an input image. A lower temperature (e.g., `--temp 0.1`) is recommended for better quality, and GPU offloading can be enabled with the `-ngl` flag.

```sh
./llama-llava-cli -m ../llava-v1.5-7b/ggml-model-f16.gguf --mmproj ../llava-v1.5-7b/mmproj-model-f16.gguf --image path/to/an/image.jpg
```

--------------------------------

### Installing Compiled BLIS Library

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/backend/BLIS.md

This command installs the previously compiled BLIS library to the system-wide default location, typically /usr/local/, making it available for other applications.

```bash
sudo make install
```

--------------------------------

### Defining and Populating Llama-bench Test Table (SQL)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/llama-bench/README.md

This SQL snippet first defines the schema for a `test` table, designed to store comprehensive benchmark results from `llama-bench`, including system, model, and performance metrics. Subsequently, it provides two `INSERT` statements demonstrating how to populate this table with example benchmark data.

```sql
CREATE TABLE IF NOT EXISTS test (
  build_commit TEXT,
  build_number INTEGER,
  cuda INTEGER,
  metal INTEGER,
  gpu_blas INTEGER,
  blas INTEGER,
  cpu_info TEXT,
  gpu_info TEXT,
  model_filename TEXT,
  model_type TEXT,
  model_size INTEGER,
  model_n_params INTEGER,
  n_batch INTEGER,
  n_threads INTEGER,
  f16_kv INTEGER,
  n_gpu_layers INTEGER,
  main_gpu INTEGER,
  mul_mat_q INTEGER,
  tensor_split TEXT,
  n_prompt INTEGER,
  n_gen INTEGER,
  test_time TEXT,
  avg_ns INTEGER,
  stddev_ns INTEGER,
  avg_ts REAL,
  stddev_ts REAL
);

INSERT INTO test (build_commit, build_number, cuda, metal, gpu_blas, blas, cpu_info, gpu_info, model_filename, model_type, model_size, model_n_params, n_batch, n_threads, f16_kv, n_gpu_layers, main_gpu, mul_mat_q, tensor_split, n_prompt, n_gen, test_time, avg_ns, stddev_ns, avg_ts, stddev_ts) VALUES ('3469684', '1275', '1', '0', '0', '1', '1', '13th Gen Intel(R) Core(TM) i9-13900K', 'NVIDIA GeForce RTX 3090 Ti', 'models/7B/ggml-model-q4_0.gguf', 'llama 7B mostly Q4_0', '3825065984', '6738415616', '512', '16', '1', '99', '0', '1', '0.00', '512', '0', '2023-09-23T12:10:30Z', '212693772', '743623', '2407.240204', '8.409634');
INSERT INTO test (build_commit, build_number, cuda, metal, gpu_blas, blas, cpu_info, gpu_info, model_filename, model_type, model_size, model_n_params, n_batch, n_threads, f16_kv, n_gpu_layers, main_gpu, mul_mat_q, tensor_split, n_prompt, n_gen, test_time, avg_ns, stddev_ns, avg_ts, stddev_ts) VALUES ('3469684', '1275', '1', '0', '0', '1', '1', '13th Gen Intel(R) Core(TM) i9-13900K', 'NVIDIA GeForce RTX 3090 Ti', 'models/7B/ggml-model-q4_0.gguf', 'llama 7B mostly Q4_0', '3825065984', '6738415616', '512', '16', '1', '99', '0', '1', '0.00', '0', '128', '2023-09-23T12:10:31Z', '977925003', '4037361', '130.891159', '0.537692');
```

--------------------------------

### Installing HF to GGUF Conversion Script - CMake

Source: https://github.com/lizonghang/prima.cpp/blob/main/CMakeLists.txt

This snippet installs the `convert_hf_to_gguf.py` Python script into the installation's binary directory. It sets specific file permissions, including read, write, and execute for the owner, and read and execute for group and world. This ensures the script is executable after installation.

```CMake
install(
    FILES convert_hf_to_gguf.py
    PERMISSIONS
        OWNER_READ
        OWNER_WRITE
        OWNER_EXECUTE
        GROUP_READ
        GROUP_EXECUTE
        WORLD_READ
        WORLD_EXECUTE
    DESTINATION ${CMAKE_INSTALL_BINDIR})
```

--------------------------------

### Generated JSON Schema for Strict Zod Object

Source: https://github.com/lizonghang/prima.cpp/blob/main/grammars/README.md

This JSON snippet represents the schema generated from the Zod object defined in the previous JavaScript example. It specifies an object with `age` (positive number) and `email` (string with email format) as required properties. Crucially, `"additionalProperties": false` is set, reflecting the strictness enforced by the Zod schema and the current behavior of `zod-to-json-schema`.

```JSON
{
  "type": "object",
  "properties": {
    "age": {
      "type": "number",
      "exclusiveMinimum": 0
    },
    "email": {
      "type": "string",
      "format": "email"
    }
  },
  "required": [
    "age",
    "email"
  ],
  "additionalProperties": false,
  "$schema": "http://json-schema.org/draft-07/schema#"
}
```

--------------------------------

### Applying a Custom Theme to LLaMA.cpp Server (Shell)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/server/themes/README.md

This command demonstrates how to start the LLaMA.cpp server and apply a custom theme located in the 'wild' directory. The `--path` argument specifies the directory containing the theme's public assets, allowing the server to serve content from that location.

```Shell
server --path=wild
```

--------------------------------

### Configuring GGML Installation (CMake)

Source: https://github.com/lizonghang/prima.cpp/blob/main/ggml/CMakeLists.txt

This snippet includes standard CMake modules for configuring the installation process of the GGML library. It uses `GNUInstallDirs` for standard installation directory variables and `CMakePackageConfigHelpers` for generating package configuration files.

```CMake
include(GNUInstallDirs)
include(CMakePackageConfigHelpers)
```

--------------------------------

### Adding Executable and Installation Rule (CMake)

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/server/CMakeLists.txt

This snippet defines the `llama-server` executable using the previously collected source files and generated asset headers. It also sets up an installation rule to ensure the executable is installed as a runtime component.

```CMake
add_executable(${TARGET} ${TARGET_SRCS})
install(TARGETS ${TARGET} RUNTIME)
```

--------------------------------

### Downloading GritLM Model using HF Script

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/gritlm/README.md

This command downloads a specific GritLM model (`gritlm-7b_q4_1.gguf`) from the `cohesionet/GritLM-7B_gguf` Hugging Face repository into the `models` directory. It uses a helper script `hf.sh` to facilitate the download process.

```console
scripts/hf.sh --repo cohesionet/GritLM-7B_gguf --file gritlm-7b_q4_1.gguf --outdir models
```

--------------------------------

### Installing Llama CMake Package Files - CMake

Source: https://github.com/lizonghang/prima.cpp/blob/main/CMakeLists.txt

This snippet installs the previously configured `llama-config.cmake` and `llama-version.cmake` files. These files are placed in the `cmake/llama` subdirectory within the installation's library directory. This makes the Llama package discoverable by other CMake projects.

```CMake
install(FILES ${CMAKE_CURRENT_BINARY_DIR}/llama-config.cmake
              ${CMAKE_CURRENT_BINARY_DIR}/llama-version.cmake
        DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/llama)
```

--------------------------------

### Installing Llama PKG-Config File - CMake

Source: https://github.com/lizonghang/prima.cpp/blob/main/CMakeLists.txt

This snippet installs the generated `llama.pc` file into the `lib/pkgconfig` directory within the installation prefix. This makes the Llama library discoverable by build systems that rely on `pkg-config`. It ensures proper integration with various build environments.

```CMake
install(FILES "${CMAKE_CURRENT_BINARY_DIR}/llama.pc"
        DESTINATION lib/pkgconfig)
```

--------------------------------

### Building and Installing llama.cpp via Android NDK Cross-Compilation (CMake/Shell)

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/android.md

These commands build the 'llama.cpp' project in release mode using the previously configured 'build-android' directory and then install the compiled binaries and libraries to a specified installation directory. The '-j{n}' flag allows parallel compilation.

```Shell
cmake --build build-android --config Release -j{n}
cmake --install build-android --prefix {install-dir} --config Release
```

--------------------------------

### Running llama-cli Inference on a Single Device (Shell)

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/backend/CANN.md

This command executes the `llama-cli` tool to perform inference on a single specified device. It loads a model, provides a prompt, sets the maximum number of tokens to generate, enables end-of-sequence token generation, specifies the number of layers to offload to the GPU, sets the split mode to 'none', and targets device ID 0.

```Shell
./build/bin/llama-cli -m path_to_model -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
```

--------------------------------

### Verifying GPU Driver Installation with clinfo (Linux)

Source: https://github.com/lizonghang/prima.cpp/blob/main/docs/backend/SYCL.md

This code block provides commands to install the `clinfo` utility and then use it to verify the successful installation of OpenCL drivers. Running `clinfo -l` lists all detected OpenCL platforms and devices, confirming that the GPU drivers are correctly recognized by the system.

```sh
sudo apt install clinfo
sudo clinfo -l
```

--------------------------------

### Building llama-server with Make

Source: https://github.com/lizonghang/prima.cpp/blob/main/examples/server/README.md

This command compiles the `llama-server` executable using the `make` build system from the project's root directory. It's a standard way to build the server without SSL support.

```bash
make llama-server
```

--------------------------------

### Configuring Installation Directories and Versioning in CMake

Source: https://github.com/lizonghang/prima.cpp/blob/main/CMakeLists.txt

This section configures the installation paths for headers, libraries, and binaries using standard GNUInstallDirs. It also sets project versioning information based on build number and commit, which is crucial for creating a relocatable CMake package and ensuring proper installation of project artifacts.

```CMake
include(GNUInstallDirs)
include(CMakePackageConfigHelpers)

set(LLAMA_BUILD_NUMBER        ${BUILD_NUMBER})
set(LLAMA_BUILD_COMMIT        ${BUILD_COMMIT})
set(LLAMA_INSTALL_VERSION 0.0.${BUILD_NUMBER})

set(LLAMA_INCLUDE_INSTALL_DIR ${CMAKE_INSTALL_INCLUDEDIR} CACHE PATH "Location of header  files")
set(LLAMA_LIB_INSTALL_DIR     ${CMAKE_INSTALL_LIBDIR}     CACHE PATH "Location of library files")
set(LLAMA_BIN_INSTALL_DIR     ${CMAKE_INSTALL_BINDIR}     CACHE PATH "Location of binary  files")
```