### Setup Python Virtual Environment Source: https://github.com/anbeeld/beellama.cpp/blob/main/examples/model-conversion/README.md Create and activate a Python virtual environment, then install dependencies from requirements.txt. ```console python3.11 -m venv venv source venv/bin/activate pip install -r requirements.txt ``` -------------------------------- ### Running Llama Server with Jinja Templates Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/function-calling.md Examples of starting the llama server with different models and Jinja templates for generic format support. ```bash llama-server --jinja -fa -hf bartowski/phi-4-GGUF:Q4_0 ``` ```bash llama-server --jinja -fa -hf bartowski/gemma-2-2b-it-GGUF:Q8_0 ``` ```bash llama-server --jinja -fa -hf bartowski/c4ai-command-r-v01-GGUF:Q2_K ``` -------------------------------- ### Start Development Server Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/ui/README.md Start the Vite development server for the UI frontend. ```bash npm run dev ``` -------------------------------- ### Basic CMake Project Setup Source: https://github.com/anbeeld/beellama.cpp/blob/main/ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt Sets the minimum CMake version and defines the project name and languages. This is a standard starting point for any CMake project. ```cmake cmake_minimum_required(VERSION 3.19) project("vulkan-shaders-gen" C CXX) ``` -------------------------------- ### Set up LunarG Vulkan SDK Environment Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/build.md Source the setup script to configure your environment for the LunarG Vulkan SDK on macOS. This is typically done after installation. ```bash source /path/to/vulkan-sdk/setup-env.sh ``` -------------------------------- ### Install Dependencies Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/ui/README.md Navigate to the ui directory and install project dependencies using npm. ```bash cd tools/ui npm install ``` -------------------------------- ### LLaDA Architecture Example Source: https://github.com/anbeeld/beellama.cpp/blob/main/examples/diffusion/README.md Example command for running the LLaDA architecture with block-based scheduling and visualization. ```bash llama-diffusion-cli -m llada-8b.gguf -p "write code to train MNIST in pytorch" -ub 512 --diffusion-block-length 32 --diffusion-steps 256 --diffusion-visual ``` -------------------------------- ### Install and Verify clinfo Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/backend/SYCL.md Install the 'clinfo' utility to verify GPU driver installation and list available OpenCL devices. ```shell sudo apt install clinfo sudo clinfo -l ``` -------------------------------- ### Using DFlash and TurboQuant with BeeLlama.cpp Server Source: https://github.com/anbeeld/beellama.cpp/blob/main/README.md Example of configuring the BeeLlama.cpp server to use DFlash and TurboQuant for optimized inference. This setup is for advanced performance tuning. ```sh llama-server -m target.gguf --spec-type dflash \ --spec-draft-model drafter.gguf \ --spec-draft-ngl all \ --flash-attn on --cache-type-k turbo4 --cache-type-v turbo3_tcq ``` -------------------------------- ### Install k6 and xk6-sse Extension Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/bench/README.md Build k6 with the xk6-sse extension to support SSE. Requires Go to be installed. ```shell go install go.k6.io/xk6/cmd/xk6@latest $GOPATH/bin/xk6 build master \ --with github.com/phymbert/xk6-sse ``` -------------------------------- ### Install Executable Source: https://github.com/anbeeld/beellama.cpp/blob/main/examples/training/CMakeLists.txt Installs the 'llama-finetune' target, making its runtime available. ```cmake install(TARGETS ${TARGET} RUNTIME) ``` -------------------------------- ### Run Retrieval Example Source: https://github.com/anbeeld/beellama.cpp/blob/main/examples/retrieval/README.md Execute the retrieval example with specified model, context files, chunk size, and separator. ```bash llama-retrieval --model ./models/bge-base-en-v1.5-f16.gguf --top-k 3 --context-file README.md --context-file License --chunk-size 100 --chunk-separator . ``` -------------------------------- ### Install SPEED-Bench Client Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/bench/speed-bench/README.md Install the necessary Python packages for the SPEED-Bench client. Ensure you are in the correct directory. ```bash pip install -r tools/server/bench/speed-bench/requirements.txt ``` -------------------------------- ### Model Preset Configuration Example Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README.md Example .ini file demonstrating how to define global and model-specific configurations for llama-server presets. ```ini version = 1 ; (Optional) This section provides global settings shared across all presets. ; If the same key is defined in a specific preset, it will override the value in this global section. [*] c = 8192 n-gpu-layers = 8 ; If the key corresponds to an existing model on the server, ; this will be used as the default config for that model [ggml-org/MY-MODEL-GGUF:Q8_0] ; string value chat-template = chatml ; numeric value n-gpu-layers = 123 ; flag value (for certain flags, you need to use the "no-" prefix for negation) jinja = true ; shorthand argument (for example, context size) c = 4096 ; environment variable name LLAMA_ARG_CACHE_RAM = 0 ; file paths are relative to server's CWD model-draft = ./my-models/draft.gguf ; but it's RECOMMENDED to use absolute path model-draft = /Users/abc/my-models/draft.gguf ; If the key does NOT correspond to an existing model, ; you need to specify at least the model path or HF repo [custom_model] model = /Users/abc/my-awesome-model-Q4_K_M.gguf ``` -------------------------------- ### Install HTP Library Source: https://github.com/anbeeld/beellama.cpp/blob/main/ggml/src/ggml-hexagon/htp/CMakeLists.txt Installs the built HTP library. ```cmake install(TARGETS ${HTP_LIB}) ``` -------------------------------- ### Dream Architecture Example Source: https://github.com/anbeeld/beellama.cpp/blob/main/examples/diffusion/README.md Example command for running the Dream architecture with specified diffusion parameters and visualization enabled. ```bash llama-diffusion-cli -m dream7b.gguf -p "write code to train MNIST in pytorch" -ub 512 --diffusion-eps 0.001 --diffusion-algorithm 3 --diffusion-steps 256 --diffusion-visual ``` -------------------------------- ### Start Embedding Server Source: https://github.com/anbeeld/beellama.cpp/blob/main/examples/model-conversion/README.md Starts the embedding server for model verification. Ensure the virtual environment is activated. ```console (venv) $ make embedding-start-embedding-server ``` -------------------------------- ### RND1 Architecture Example Source: https://github.com/anbeeld/beellama.cpp/blob/main/examples/diffusion/README.md Example command for running the RND1 architecture with specific diffusion algorithm, sampling temperature, and epsilon. ```bash llama-diffusion-cli -m RND1-Base-0910.gguf -p "write code to train MNIST in pytorch" -ub 512 --diffusion-algorithm 1 --diffusion-steps 256 --diffusion-visual --temp 0.5 --diffusion-eps 0.001 ``` -------------------------------- ### Install Build Tools on Windows Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/backend/OPENVINO.md Installs Git, Wget, and Ninja build tools on Windows using winget. ```powershell # Windows PowerShell winget install Git.Git winget install GNU.Wget winget install Ninja-build.Ninja ``` -------------------------------- ### Build and Install llama.cpp for Android Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/android.md After configuration, use these commands to build the project in release mode and install it to a specified directory. ```bash cmake --build build-android --config Release -j{n} ``` ```bash cmake --install build-android --prefix {install-dir} --config Release ``` -------------------------------- ### Run MobileVLM CLI Example Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/multimodal/MobileVLM.md Example of how to run the MobileVLM command-line interface with a specified model, mmproj, and chat template. ```sh ./llama-mtmd-cli -m MobileVLM-1.7B/ggml-model-q4_k.gguf \ --mmproj MobileVLM-1.7B/mmproj-model-f16.gguf \ --chat-template deepseek ``` -------------------------------- ### Run Interactive Chat Example (Bash) Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README.md Command to run the interactive chat example using Bash, curl, and jq. ```sh bash chat.sh ``` -------------------------------- ### Start LLM Model Server for TTS Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/tts/README.md Starts a llama-server instance to serve the OuteTTS LLM model on port 8020. This is part of running the TTS example with llama-server. ```console $ ./build/bin/llama-server -m ./models/outetts-0.2-0.5B-q8_0.gguf --port 8020 ``` -------------------------------- ### Install NDK and OpenCL Headers/Library for Android Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/backend/OPENCL.md Installs the Android NDK and clones/builds OpenCL headers and ICD loader for Android development. Ensure the NDK path and version match your setup. ```shell cd ~ wget https://dl.google.com/android/repository/commandlinetools-linux-8512546_latest.zip && \ unzip commandlinetools-linux-8512546_latest.zip && \ mkdir -p ~/android-sdk/cmdline-tools && \ mv cmdline-tools latest && \ mv latest ~/android-sdk/cmdline-tools/ && \ rm -rf commandlinetools-linux-8512546_latest.zip yes | ~/android-sdk/cmdline-tools/latest/bin/sdkmanager "ndk;26.3.11579264" ``` ```shell mkdir -p ~/dev/llm cd ~/dev/llm git clone https://github.com/KhronosGroup/OpenCL-Headers && \ cd OpenCL-Headers && \ cp -r CL ~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include cd ~/dev/llm git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && \ cd OpenCL-ICD-Loader && \ mkdir build_ndk26 && cd build_ndk26 && \ cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_TOOLCHAIN_FILE=$HOME/android-sdk/ndk/26.3.11579264/build/cmake/android.toolchain.cmake \ -DOPENCL_ICD_LOADER_HEADERS_DIR=$HOME/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include \ -DANDROID_ABI=arm64-v8a \ -DANDROID_PLATFORM=24 \ -DANDROID_STL=c++_shared && \ ninja && \ cp libOpenCL.so ~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android ``` -------------------------------- ### Basic Usage Example Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/batched-bench/README.md Run the benchmark with specified model, context size, batch sizes, and prompt/generation token counts. This example shows a custom set of batches. ```bash ./llama-batched-bench -m model.gguf -c 2048 -b 2048 -ub 512 -npp 128,256,512 -ntg 128,256 -npl 1,2,4,8,16,32 [-pps] ``` -------------------------------- ### Complete VirtGPU Configuration Example (macOS Metal) Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/backend/VirtGPU/configuration.md This example shows how to configure VirtGPU for a macOS host using the Metal backend. It sets environment variables for the hypervisor, backend, and optional logging. ```bash # Hypervisor environment export VIRGL_APIR_BACKEND_LIBRARY="/opt/llama.cpp/lib/libggml-virtgpu-backend.dylib" # Backend configuration export APIR_LLAMA_CPP_GGML_LIBRARY_PATH="/opt/llama.cpp/lib/libggml-metal.dylib" export APIR_LLAMA_CPP_GGML_LIBRARY_REG="ggml_backend_metal_reg" # Optional logging export VIRGL_APIR_LOG_TO_FILE="/tmp/apir.log" export APIR_LLAMA_CPP_LOG_TO_FILE="/tmp/ggml.log" # Guest configuration export GGML_REMOTING_USE_APIR_CAPSET=1 ``` -------------------------------- ### Execute SYCL Example Script Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/backend/SYCL.md Run SYCL examples using a provided script. Supports selecting a single device or using multiple devices automatically. ```sh ./examples/sycl/test.sh -mg 0 ``` ```sh ./examples/sycl/test.sh ``` -------------------------------- ### Web UI Development: Run Dev Server Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README-dev.md Starts the development server for the Web UI, enabling hot reloading for rapid development. This command should be run after installing dependencies. ```sh # run dev server (with hot reload) npm run dev ``` -------------------------------- ### GET /props Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README.md Retrieves the server's global properties. By default, this endpoint is read-only. To enable modifications via POST requests, the server must be started with the `--props` flag. ```APIDOC ## GET /props: Get server global properties. ### Description Retrieves the server's global properties. This endpoint is read-only by default. To enable modifications via POST requests, the server must be started with the `--props` flag. ### Method GET ### Endpoint /props ### Response #### Success Response (200) - **default_generation_settings** (object) - The default generation settings for the `/completion` endpoint. - **total_slots** (integer) - The total number of slots for processing requests. - **model_path** (string) - The path to the model file. - **chat_template** (string) - The model's original Jinja2 prompt template. - **chat_template_caps** (object) - Capabilities of the chat template. - **modalities** (object) - The list of supported modalities. - **media_marker** (string) - A media marker string. - **build_info** (string) - Build information of the server. - **is_sleeping** (boolean) - Sleeping status of the server. #### Response Example ```json { "default_generation_settings": { "id": 0, "id_task": -1, "n_ctx": 1024, "speculative": false, "is_processing": false, "params": { "n_predict": -1, "seed": 4294967295, "temperature": 0.800000011920929, "dynatemp_range": 0.0, "dynatemp_exponent": 1.0, "top_k": 40, "top_p": 0.949999988079071, "min_p": 0.05000000074505806, "xtc_probability": 0.0, "xtc_threshold": 0.10000000149011612, "typical_p": 1.0, "repeat_last_n": 64, "repeat_penalty": 1.0, "presence_penalty": 0.0, "frequency_penalty": 0.0, "dry_multiplier": 0.0, "dry_base": 1.75, "dry_allowed_length": 2, "dry_penalty_last_n": -1, "dry_sequence_breakers": [ "\n", ":", "\"", "*" ], "mirostat": 0, "mirostat_tau": 5.0, "mirostat_eta": 0.10000000149011612, "stop": [], "max_tokens": -1, "n_keep": 0, "n_discard": 0, "ignore_eos": false, "stream": true, "n_probs": 0, "min_keep": 0, "grammar": "", "samplers": [ "dry", "top_k", "typ_p", "top_p", "min_p", "xtc", "temperature" ], "speculative.n_max": 16, "speculative.n_min": 5, "speculative.p_min": 0.8999999761581421, "timings_per_token": false }, "prompt": "", "next_token": { "has_next_token": true, "has_new_line": false, "n_remain": -1, "n_decoded": 0, "stopping_word": "" } }, "total_slots": 1, "model_path": "../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf", "chat_template": "...", "chat_template_caps": {}, "modalities": { "vision": false }, "media_marker": "<__media_YoNhud46VdDqbuFmKYEO9PY7A4ARzRfg__>", "build_info": "b(build number)-(build commit hash)", "is_sleeping": false } ``` ``` -------------------------------- ### Prepare Visual Encoder Directory Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/multimodal/granitevision.md Create a directory for visual components and copy the llava.clip and llava.projector files into it. ```bash $ ENCODER_PATH=$PWD/visual_encoder $ mkdir $ENCODER_PATH $ cp $GRANITE_MODEL/llava.clip $ENCODER_PATH/pytorch_model.bin $ cp $GRANITE_MODEL/llava.projector $ENCODER_PATH/ ``` -------------------------------- ### Get Server Global Properties Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README.md Retrieves the current server global properties. This is a read-only operation by default. To enable POST requests for changing properties, start the server with the `--props` flag. ```json { "default_generation_settings": { "id": 0, "id_task": -1, "n_ctx": 1024, "speculative": false, "is_processing": false, "params": { "n_predict": -1, "seed": 4294967295, "temperature": 0.800000011920929, "dynatemp_range": 0.0, "dynatemp_exponent": 1.0, "top_k": 40, "top_p": 0.949999988079071, "min_p": 0.05000000074505806, "xtc_probability": 0.0, "xtc_threshold": 0.10000000149011612, "typical_p": 1.0, "repeat_last_n": 64, "repeat_penalty": 1.0, "presence_penalty": 0.0, "frequency_penalty": 0.0, "dry_multiplier": 0.0, "dry_base": 1.75, "dry_allowed_length": 2, "dry_penalty_last_n": -1, "dry_sequence_breakers": [ "\n", ":", "\"", "*" ], "mirostat": 0, "mirostat_tau": 5.0, "mirostat_eta": 0.10000000149011612, "stop": [], "max_tokens": -1, "n_keep": 0, "n_discard": 0, "ignore_eos": false, "stream": true, "n_probs": 0, "min_keep": 0, "grammar": "", "samplers": [ "dry", "top_k", "typ_p", "top_p", "min_p", "xtc", "temperature" ], "speculative.n_max": 16, "speculative.n_min": 5, "speculative.p_min": 0.8999999761581421, "timings_per_token": false }, "prompt": "", "next_token": { "has_next_token": true, "has_new_line": false, "n_remain": -1, "n_decoded": 0, "stopping_word": "" } }, "total_slots": 1, "model_path": "../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf", "chat_template": "...", "chat_template_caps": {}, "modalities": { "vision": false }, "media_marker": "<__media_YoNhud46VdDqbuFmKYEO9PY7A4ARzRfg__>", "build_info": "b(build number)-(build commit hash)", "is_sleeping": false } ``` -------------------------------- ### Start llama-server for Benchmarking Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/bench/speed-bench/README.md Launch the llama-server with specific configurations for benchmarking. Match the client's concurrency to the server's slot count for accurate throughput measurements. ```bash llama-server \ -m target.gguf \ -c 8192 \ --port 8080 \ -ngl 99 -fa on \ --np 1 \ --jinja ``` -------------------------------- ### Start Server with Hermes Template Override Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/function-calling.md When running Hermes models, use the `--chat-template-file` argument to specify the correct Jinja template for tool interaction, ensuring proper function calling setup. ```shell llama-server --jinja -fa -hf bartowski/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M \ --chat-template-file models/templates/NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja ``` ```shell llama-server --jinja -fa -hf bartowski/Hermes-3-Llama-3.1-8B-GGUF:Q4_K_M \ --chat-template-file models/templates/NousResearch-Hermes-3-Llama-3.1-8B-tool_use.jinja ``` -------------------------------- ### Docker Compose with Environment Variables Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README.md Example of configuring the llama.cpp server using Docker Compose and environment variables. This setup specifies the model path, context size, parallel processing, metrics endpoint, and server port. ```yaml services: llamacpp-server: image: ghcr.io/ggml-org/llama.cpp:server ports: - 8080:8080 volumes: - ./models:/models environment: # alternatively, you can use "LLAMA_ARG_MODEL_URL" to download the model LLAMA_ARG_MODEL: /models/my_model.gguf LLAMA_ARG_CTX_SIZE: 4096 LLAMA_ARG_N_PARALLEL: 2 LLAMA_ARG_ENDPOINT_METRICS: 1 LLAMA_ARG_PORT: 8080 ``` -------------------------------- ### Setup Build Context for Debugging Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/development/debugging-tests.md Prepare a clean build directory for debugging. This involves removing any existing build directory and creating a new one. ```bash rm -rf build-ci-debug && mkdir build-ci-debug && cd build-ci-debug ``` -------------------------------- ### CMakeLists.txt for GGUF Example Source: https://github.com/anbeeld/beellama.cpp/blob/main/examples/gguf/CMakeLists.txt This CMakeLists.txt file configures the build for a C++ executable named 'llama-gguf'. It specifies the source file, installation target, and links against the 'ggml' library and threading support. It also sets the C++ standard to C++17. ```cmake set(TARGET llama-gguf) add_executable(${TARGET} gguf.cpp) install(TARGETS ${TARGET} RUNTIME) target_link_libraries(${TARGET} PRIVATE ggml ${CMAKE_THREAD_LIBS_INIT}) target_compile_features(${TARGET} PRIVATE cxx_std_17) ``` -------------------------------- ### Example Slot Status with 2 Slots Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README.md Shows the current processing state for two available slots, including task ID, context size, processing status, and sampling parameters. This response is returned by the GET /slots endpoint. ```json [{"id":0,"id_task":135,"n_ctx":65536,"speculative":false,"is_processing":true,"params":{"n_predict":-1,"seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":131072,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"n_probs":0,"min_keep":0,"chat_format":"GPT-OSS","reasoning_format":"none","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"next_token":{"has_next_token":true,"has_new_line":false,"n_remain":-1,"n_decoded":0}},{"id":1,"id_task":0,"n_ctx":65536,"speculative":false,"is_processing":true,"params":{"n_predict":-1,"seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":131072,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"n_probs":0,"min_keep":0,"chat_format":"GPT-OSS","reasoning_format":"none","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"next_token":{"has_next_token":true,"has_new_line":true,"n_remain":-1,"n_decoded":136}}] ``` -------------------------------- ### Vulkan Shader Generation Setup Source: https://github.com/anbeeld/beellama.cpp/blob/main/ggml/src/ggml-vulkan/CMakeLists.txt Sets up variables and finds necessary executables for Vulkan shader generation. This includes defining output directories, shader input paths, and the command for the shader generator tool. ```cmake set (_ggml_vk_host_suffix $,.exe,>) set (_ggml_vk_genshaders_dir "${CMAKE_BINARY_DIR}/$") set (_ggml_vk_genshaders_cmd "${_ggml_vk_genshaders_dir}/vulkan-shaders-gen${_ggml_vk_host_suffix}") set (_ggml_vk_header "${CMAKE_CURRENT_BINARY_DIR}/ggml-vulkan-shaders.hpp") set (_ggml_vk_input_dir "${CMAKE_CURRENT_SOURCE_DIR}/vulkan-shaders") set (_ggml_vk_output_dir "${CMAKE_CURRENT_BINARY_DIR}/vulkan-shaders.spv") file(GLOB _ggml_vk_shader_files CONFIGURE_DEPENDS "${_ggml_vk_input_dir}/*.comp") # Because external projects do not provide source-level tracking, # the vulkan-shaders-gen sources need to be explicitly added to # ensure that changes will cascade into shader re-generation. file(GLOB _ggml_vk_shaders_gen_sources CONFIGURE_DEPENDS "${_ggml_vk_input_dir}/*.cpp" "${_ggml_vk_input_dir}/*.h") ``` -------------------------------- ### Run Completion with Multiple Hexagon NPU Sessions Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/backend/snapdragon/README.md Execute a summary request using the OLMoE-1B-7B model, requiring two Hexagon NPU sessions (HTP0, HTP1) due to its size. This example demonstrates multi-session setup and provides performance and memory breakdown. ```bash ~/src/llama.cpp$ M=OLMoE-1B-7B-0125-Instruct-Q4_0.gguf NDEV=2 D=HTP0,HTP1 ./scripts/snapdragon/adb/run-completion.sh -f surfing.txt ... ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1 ggml-hex: Hexagon Arch version v81 ggml-hex: allocating new session: HTP0 ggml-hex: allocating new session: HTP1 ... load_tensors: offloading output layer to GPU load_tensors: offloaded 17/17 layers to GPU load_tensors: CPU model buffer size = 143.86 MiB load_tensors: HTP1 model buffer size = 0.23 MiB load_tensors: HTP1-REPACK model buffer size = 1575.00 MiB load_tensors: HTP0 model buffer size = 0.28 MiB load_tensors: HTP0-REPACK model buffer size = 2025.00 MiB ... llama_context: CPU output buffer size = 0.19 MiB llama_kv_cache: HTP1 KV buffer size = 238.00 MiB llama_kv_cache: HTP0 KV buffer size = 306.00 MiB llama_kv_cache: size = 544.00 MiB ( 8192 cells, 16 layers, 1/1 seqs), K (q8_0): 272.00 MiB, V (q8_0): 272.00 MiB llama_context: HTP0 compute buffer size = 15.00 MiB llama_context: HTP1 compute buffer size = 15.00 MiB llama_context: CPU compute buffer size = 24.56 MiB ... llama_perf_context_print: prompt eval time = 1730.57 ms / 212 tokens ( 8.16 ms per token, 122.50 tokens per second) llama_perf_context_print: eval time = 5624.75 ms / 257 runs ( 21.89 ms per token, 45.69 tokens per second) llama_perf_context_print: total time = 7377.33 ms / 469 tokens llama_perf_context_print: graphs reused = 255 llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 | llama_memory_breakdown_print: | - HTP1 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 | llama_memory_breakdown_print: | - Host | 742 = 144 + 544 + 54 | llama_memory_breakdown_print: | - HTP1-REPACK | 1575 = 1575 + 0 + 0 | llama_memory_breakdown_print: | - HTP0-REPACK | 2025 = 2025 + 0 + 0 | ``` -------------------------------- ### Install Hexagon Skels Source: https://github.com/anbeeld/beellama.cpp/blob/main/ggml/src/ggml-hexagon/CMakeLists.txt Installs Hexagon skels required at runtime. This is a basic installation command. ```cmake install(FILES ${HTP_SKELS} TYPE LIB) ``` -------------------------------- ### Install and Set Up Environment on Target Device Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/backend/snapdragon/linux.md Transfer the zipped package to the target Linux device, unzip it, and set the LD_LIBRARY_PATH and ADSP_LIBRARY_PATH environment variables to include the libraries. ```bash $ unzip pkg-snapdragon.zip $ cd pkg-snapdragon $ export LD_LIBRARY_PATH=./lib $ export ADSP_LIBRARY_PATH=./lib ``` -------------------------------- ### OpenAI-Compatible Server Commands for BeeLlama.cpp Source: https://github.com/anbeeld/beellama.cpp/blob/main/README.md Shows how to start the OpenAI-compatible server for BeeLlama.cpp. Use these commands to expose models via an API. ```sh llama-server -m model.gguf --port 8080 ``` ```sh llama-server -m model.gguf -c 16384 -np 4 ``` ```sh llama-server -m model.gguf -md draft.gguf ``` -------------------------------- ### Install gguf Python Package Source: https://github.com/anbeeld/beellama.cpp/blob/main/gguf-py/README.md Install the gguf package using pip. This is the basic installation for using the package's functionalities. ```sh pip install gguf ``` -------------------------------- ### Web UI Development: Install Dependencies Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README-dev.md Installs the necessary Node.js dependencies for the Web UI development server. Ensure Node.js is installed before running. ```sh # make sure you have Node.js installed cd tools/ui npm i ``` -------------------------------- ### Quick Start llama-server on Windows Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/server/README.md Start the llama-server on Windows. Use the .exe executable and specify the correct path for the model file and context size. ```powershell llama-server.exe -m models\7B\ggml-model.gguf -c 2048 ``` -------------------------------- ### Start Server with Native Tool-Aware Jinja Support Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/function-calling.md Use this command to start a server with native support for tool-aware Jinja templates. Ensure the model specified has a compatible chat template. ```shell llama-server --jinja -fa -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M ``` ```shell llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L ``` ```shell llama-server --jinja -fa -hf bartowski/Llama-3.3-70B-Instruct-GGUF:Q4_K_M ``` ```shell llama-server --jinja -fa -hf bartowski/granite-4.1-3b-GGUF:Q4_K_M ``` -------------------------------- ### Verify CUDA Installation Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/backend/CUDA-FEDORA.md Checks the installed version of the NVIDIA CUDA Compiler (nvcc). This command confirms that CUDA is correctly installed and accessible in the PATH. ```bash nvcc --version ``` -------------------------------- ### Upgrade Pip for Editable Installation Source: https://github.com/anbeeld/beellama.cpp/blob/main/gguf-py/README.md If an editable installation requires an upgrade to Pip, use this command to install the latest version. This ensures compatibility with modern packaging standards. ```sh pip install --upgrade pip ``` -------------------------------- ### Example BeeLlama DFlash Launch Script Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/beellama-args.md This example demonstrates a typical DFlash launch script configuration for BeeLlama, including model paths, speculative decoding settings, context sizes, KV cache precision, and logging options. It serves as a template for setting up a high-performance BeeLlama server. ```shell llama-server \ -m "path/to/target.gguf" \ --mmproj "path/to/mmproj.gguf" \ --no-mmproj-offload \ --spec-draft-model "path/to/drafter.gguf" \ --spec-type dflash \ --spec-dflash-cross-ctx 1024 \ --port 8082 \ -np 1 \ --kv-unified \ -ngl all \ --spec-draft-ngl all \ -b 2048 -ub 512 \ --ctx-size 102400 \ --cache-type-k q5_0 --cache-type-v q4_0 \ --flash-attn on \ --cache-ram 0 \ --jinja \ --no-mmap --mlock \ --no-host --metrics \ --log-timestamps --log-prefix --log-colors off \ --reasoning on \ --chat-template-kwargs '{"preserve_thinking":true}' \ --temp 0.6 --top-k 20 --min-p 0.0 ``` -------------------------------- ### Install OpenCL Headers for Windows Arm64 Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/build.md Installs OpenCL headers and the ICD loader for Windows Arm64. CMake is used with Ninja generator, and installation paths are specified. ```powershell mkdir -p ~/dev/llm cd ~/dev/llm git clone https://github.com/KhronosGroup/OpenCL-Headers && cd OpenCL-Headers mkdir build && cd build cmake .. -G Ninja \ -DBUILD_TESTING=OFF \ -DOPENCL_HEADERS_BUILD_TESTING=OFF \ -DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF \ -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl" cmake --build . --target install cd ~/dev/llm git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && cd OpenCL-ICD-Loader mkdir build && cd build cmake .. -G Ninja \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \ -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl" cmake --build . --target install ``` -------------------------------- ### Custom Batch Configuration Example Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/batched-bench/README.md Run the benchmark with a specific set of prompt tokens per batch, tokens per generation, and number of parallel batches. ```bash ./llama-batched-bench -m ./models/llama-7b/ggml-model-q8_0.gguf -c 2048 -b 512 -ub 512 -ngl 999 -npp 128,256,512 -ntg 128,256 -npl 1,2,4,8,16,32 ``` -------------------------------- ### Install llama.cpp via Nix (Non-flake-enabled) Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/install.md Use this command to install llama.cpp on Mac and Linux for non-flake-enabled Nix installations. This expression is automatically updated within the nixpkgs repository. ```sh nix-env --file '' --install --attr llama-cpp ``` -------------------------------- ### Install llama.cpp via Nix (Flake-enabled) Source: https://github.com/anbeeld/beellama.cpp/blob/main/docs/install.md Use this command to install llama.cpp on Mac and Linux for flake-enabled Nix installations. This expression is automatically updated within the nixpkgs repository. ```sh nix profile install nixpkgs#llama-cpp ``` -------------------------------- ### Install Build and Twine for Manual Publishing Source: https://github.com/anbeeld/beellama.cpp/blob/main/gguf-py/README.md Install the 'build' and 'twine' packages, which are necessary for manually building and uploading the Python package to PyPI. ```sh pip install build twine ``` -------------------------------- ### Apply Multiple Scaled LoRA Adapters Source: https://github.com/anbeeld/beellama.cpp/blob/main/tools/export-lora/README.md Example demonstrating how to apply multiple LoRA adapters with custom scaling factors to a base model. ```bash ./bin/llama-export-lora \ -m your_base_model.gguf \ -o your_merged_model.gguf \ --lora-scaled lora_task_A.gguf 0.5 \ --lora-scaled lora_task_B.gguf 0.5 ``` -------------------------------- ### Conditional Installation Source: https://github.com/anbeeld/beellama.cpp/blob/main/app/CMakeLists.txt Installs the 'llama-app' target if the LLAMA_TOOLS_INSTALL build option is enabled. ```cmake if(LLAMA_TOOLS_INSTALL) install(TARGETS ${TARGET} RUNTIME) endif() ```