### Setup Python Environment for TTS Client Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/tts/README.md This sequence sets up a Python virtual environment and installs the necessary libraries (`requests`, `numpy`) required to run the `tts-outetts.py` client script that communicates with the `llama-server` instances. ```bash $ python3 -m venv venv $ source venv/bin/activate (venv) pip install requests numpy ``` -------------------------------- ### Quickstart TTS Generation with llama.cpp Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/tts/README.md This command demonstrates the quickest way to generate speech from text using the llama.cpp TTS example. It assumes llama.cpp is built with `-DLLAMA_CURL=ON` to automatically download necessary models. The output audio is saved as 'output.wav'. ```console $ build/bin/llama-tts --tts-oute-default -p "Hello world" && aplay output.wav ``` -------------------------------- ### Start llama-server for LLM Model Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/tts/README.md This command starts an instance of `llama-server` to serve the LLM model for TTS. It specifies the model file and the port for the server to listen on. ```bash $ ./build/bin/llama-server -m ./models/outetts-0.2-0.5B-q8_0.gguf --port 8020 ``` -------------------------------- ### Start llama-server for Voice Decoder Model Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/tts/README.md This command starts another instance of `llama-server` to serve the voice decoder model. It specifies the model file, port, and additional parameters like `--embeddings` and `--pooling none`. ```bash ./build/bin/llama-server -m ./models/wavtokenizer-large-75-f16.gguf --port 8021 --embeddings --pooling none ``` -------------------------------- ### Install and Verify CLinfo (Shell) Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/backend/SYCL.md Installs the 'clinfo' utility and then runs it to list available OpenCL platforms and devices. This helps verify GPU driver installation. ```shell sudo apt install clinfo sudo clinfo -l ``` -------------------------------- ### Setup OpenCL Environment for Windows Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/backend/OPENCL.md Installs OpenCL headers and the ICD loader from source on Windows. These dependencies are required for building llama.cpp with OpenCL support. ```powershell mkdir -p ~/dev/llm cd ~/dev/llm git clone https://github.com/KhronosGroup/OpenCL-Headers && cd OpenCL-Headers mkdir build && cd build cmake .. -G Ninja ` -DBUILD_TESTING=OFF ` -DOPENCL_HEADERS_BUILD_TESTING=OFF ` -DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF ` -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl" cmake --build . --target install cd ~/dev/llm git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && cd OpenCL-ICD-Loader mkdir build && cd build cmake .. -G Ninja ` -DCMAKE_BUILD_TYPE=Release ` -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" ` -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl" cmake --build . --target install ``` -------------------------------- ### Install Ascend Driver Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/backend/CANN.md Commands to create a user group and user for Ascend drivers, and then install the Ascend NPU driver. After installation, `npu-smi info` can be used to verify the driver installation. ```shell # create driver running user. sudo groupadd -g HwHiAiUser sudo useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser -s /bin/bash sudo usermod -aG HwHiAiUser $USER # download driver from https://www.hiascend.com/hardware/firmware-drivers/community according to your system # and install driver. sudo sh Ascend-hdk-910b-npu-driver_x.x.x_linux-{arch}.run --full --install-for-all ``` -------------------------------- ### Install Ascend Firmware Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/backend/CANN.md Command to install the Ascend NPU firmware. Download the firmware package from the official Ascend website and run the installer script. A success message indicates proper installation. ```shell # download driver from https://www.hiascend.com/hardware/firmware-drivers/community according to your system # and install driver. sudo sh Ascend-hdk-910b-npu-firmware_x.x.x.x.X.run --full ``` -------------------------------- ### Interactive Mode (Bash) Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/server/README.md Details how to run an interactive chat example using a bash script. This method requires bash, cURL, and jq to be installed. It's suitable for command-line driven interactive applications. ```shell bash chat.sh ``` -------------------------------- ### Install CANN Toolkit and Kernels Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/backend/CANN.md Installs the CANN toolkit and kernels using pip and the provided run script. It also sets up the necessary environment variables by sourcing the `set_env.sh` script in `.bashrc`. ```shell pip3 install attrs numpy decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions sh Ascend-cann-toolkit_8.0.RC2.alpha002_linux-aarch64.run --install sh Ascend-cann-kernels-910b_8.0.RC2.alpha002_linux.run --install echo "source ~/Ascend/ascend-toolkit/set_env.sh" >> ~/.bashrc source ~/.bashrc ``` -------------------------------- ### Run TTS Example with Local Models Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/tts/README.md This command executes the llama.cpp TTS example using locally converted and quantized models. It specifies the paths to the LLM model and the voice decoder model, along with the input prompt. ```bash $ build/bin/llama-tts -m ./models/outetts-0.2-0.5B-q8_0.gguf \ -mv ./models/wavtokenizer-large-75-f16.gguf \ -p "Hello world" ``` -------------------------------- ### Start Llama Server with Native Support Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/function-calling.md Starts the llama-server with Jinja support, enabling fast loading and using specified Hugging Face models. This is for models with native Jinja template support. ```shell llama-server --jinja -fa -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M ``` ```shell llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L ``` ```shell llama-server --jinja -fa -hf bartowski/Llama-3.3-70B-Instruct-GGUF:Q4_K_M ``` -------------------------------- ### Start Llama Server with Hermes Template Override Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/function-calling.md Starts the llama-server with Jinja support for Hermes models. This command requires a specific chat template file to be provided. ```shell llama-server --jinja -fa -hf bartowski/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M \ --chat-template-file models/templates/NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja ``` ```shell llama-server --jinja -fa -hf bartowski/Hermes-3-Llama-3.1-8B-GGUF:Q4_K_M \ --chat-template-file models/templates/NousResearch-Hermes-3-Llama-3.1-8B-tool_use.jinja ``` -------------------------------- ### Run GritLM Example with a downloaded model Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/examples/gritlm/README.md This command executes the llama-gritlm example program using a downloaded GritLM model. It requires the executable 'llama-gritlm' to be present and the model file to be in the specified path. ```bash $ ./llama-gritlm -m models/gritlm-7b_q4_1.gguf ``` -------------------------------- ### Install Development Tools and Dependencies Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/backend/CUDA-FEDORA.md Synchronizes the DNF package manager and installs essential development headers, compilers, and build tools required for software compilation. ```bash sudo dnf distro-sync sudo dnf install vim-default-editor --allowerasing sudo dnf install @c-development @development-tools cmake ``` -------------------------------- ### Verify Vulkan Installation Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/build.md Runs the vulkaninfo utility to ensure the Vulkan SDK is correctly installed and accessible by the system. ```sh vulkaninfo ``` -------------------------------- ### Build and Install llama.cpp Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/examples/simple-cmake-pkg/README.md Commands to clone the llama.cpp repository, build the project from source, and install the resulting binaries and headers into a local directory for external consumption. ```shell git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -S . -B build cmake --build build cmake --install build --prefix inst ``` -------------------------------- ### Start Llama Server with Functionary Template Override Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/function-calling.md Starts the llama-server with Jinja support for the functionary model. A specific chat template file is required for optimal performance. ```shell llama-server --jinja -fa -hf bartowski/functionary-small-v3.2-GGUF:Q4_K_M --chat-template-file models/templates/meetkai-functionary-medium-v3.2.jinja ``` -------------------------------- ### Start Llama Server with Cohere Command-R Template Override Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/function-calling.md Starts the llama-server with Jinja support for the Cohere Command-R model. A dedicated chat template file is necessary for tool use functionality. ```shell llama-server --jinja -fa -hf bartowski/c4ai-command-r7b-12-2024-GGUF:Q6_K_L \ --chat-template-file models/templates/CohereForAI-c4ai-command-r7b-12-2024-tool_use.jinja ``` -------------------------------- ### C++ Preprocessor Directive Example Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/CONTRIBUTING.md Provides a basic example of a preprocessor directive in C++ using '#ifdef' and '#endif'. This is a placeholder for more detailed guidelines. ```cpp #ifdef FOO #endif // FOO ``` -------------------------------- ### Testing with CURL Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/server/README.md Example of how to test the `/completion` endpoint using cURL. ```APIDOC ## Testing with CURL Using [curl](https://curl.se/). On Windows, `curl.exe` should be available in the base OS. ```sh curl --request POST \ --url http://localhost:8080/completion \ --header "Content-Type: application/json" \ --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}' ``` ``` -------------------------------- ### Run Batched Generation with llama.cpp (Bash) Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/examples/batched/README.md This command executes the `llama-batched` example, specifying the model path, a prompt, and the number of parallel sequences to generate. It outputs the generated text for each sequence and detailed performance timings, including token evaluation and generation speeds. ```bash ./llama-batched -m ./models/llama-7b-v2/ggml-model-f16.gguf -p "Hello my name is" -np 4 ... main: n_len = 32, n_ctx = 2048, n_parallel = 4, n_kv_req = 113 Hello my name is main: generating 4 sequences ... main: stream 0 finished main: stream 1 finished main: stream 2 finished main: stream 3 finished sequence 0: Hello my name is Shirley. I am a 25-year-old female who has been working for over 5 years as a b sequence 1: Hello my name is Renee and I'm a 32 year old female from the United States. I'm looking for a man between sequence 2: Hello my name is Diana. I am looking for a housekeeping job. I have experience with children and have my own transportation. I am sequence 3: Hello my name is Cody. I am a 3 year old neutered male. I am a very friendly cat. I am very playful and main: decoded 108 tokens in 3.57 s, speed: 30.26 t/s llama_print_timings: load time = 587.00 ms sample time = 2.56 ms / 112 runs ( 0.02 ms per token, 43664.72 tokens per second) prompt eval time = 4089.11 ms / 118 tokens ( 34.65 ms per token, 28.86 tokens per second) eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) total time = 4156.04 ms ``` -------------------------------- ### Set up and Check SYCL Environment Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/backend/SYCL.md Steps to prepare the environment for SYCL execution, including sourcing the oneAPI environment script and listing available SYCL devices. This helps in verifying the SYCL setup and identifying usable hardware. ```sh # Enable oneAPI running environment source /opt/intel/oneapi/setvars.sh # List devices information ./build/bin/llama-ls-sycl-device ``` -------------------------------- ### Docker Setup Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/server/README.md Instructions for running the Llama.cpp server using Docker, with and without CUDA support. ```APIDOC ## Docker ### Running the server with CPU ```bash docker run -p 8080:8080 -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:server -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080 ``` ### Running the server with CUDA ```bash docker run -p 8080:8080 -v /path/to/models:/models --gpus all ghcr.io/ggml-org/llama.cpp:server-cuda -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080 --n-gpu-layers 99 ``` ``` -------------------------------- ### Install LLaVA Python Dependencies Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/multimodal/llava.md Command to install the necessary Python packages for LLaVA model conversion and processing, using pip and a requirements file. ```shell pip install -r tools/mtmd/requirements.txt ``` -------------------------------- ### Launch Server with Legacy UI Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/server/README.md To revert to the legacy completion-based web UI, start the llama-server binary with the --path flag pointing to the legacy public directory. ```bash ./llama-server -m my_model.gguf -c 8192 --path ./tools/server/public_legacy ``` -------------------------------- ### Manage Web UI development environment Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/server/README.md Commands to install dependencies, run the development server, and build the production-ready Web UI assets. ```bash cd tools/server/webui npm i npm run dev npm run build ``` -------------------------------- ### Initialize oneAPI Environment and List SYCL Devices (Shell) Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/backend/SYCL.md Sources the oneAPI environment script to set up necessary environment variables, then uses 'sycl-ls' to list available SYCL devices on the system. This is crucial for verifying the SYCL setup for different GPU types. ```shell source /opt/intel/oneapi/setvars.sh sycl-ls ``` -------------------------------- ### Run Inference on Android Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/multimodal/MobileVLM.md Example command for running inference on an Android device using the llama-mtmd-cli binary with image input. ```sh /data/local/tmp/llama-mtmd-cli \ -m /data/local/tmp/ggml-model-q4_k.gguf \ --mmproj /data/local/tmp/mmproj-model-f16.gguf \ -t 4 \ --image /data/local/tmp/demo.jpg \ -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: \nWho is the author of this book? \nAnswer the question using a single word or phrase. ASSISTANT:" ``` -------------------------------- ### Running llama-server with Generic Format Support Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/function-calling.md Examples of launching the llama-server with different model formats (GGUF) and Jinja templates for generic format support. These commands specify the model repository and quantization level. ```bash llama-server --jinja -fa -hf bartowski/phi-4-GGUF:Q4_0 llama-server --jinja -fa -hf bartowski/gemma-2-2b-it-GGUF:Q8_0 llama-server --jinja -fa -hf bartowski/c4ai-command-r-v01-GGUF:Q2_K ``` -------------------------------- ### Define GBNF Root Rule for Lists Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/grammars/README.md A basic example of a root rule that forces the model to output a list format. ```GBNF # a grammar for lists root ::= ("- " item)+ item ::= [^\n]+ "\n" ``` -------------------------------- ### Configure llama-server via Docker Compose Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/server/README.md Example of deploying the llama-server using Docker Compose, demonstrating how to map volumes and set configuration parameters using environment variables. ```yaml services: llamacpp-server: image: ghcr.io/ggml-org/llama.cpp:server ports: - 8080:8080 volumes: - ./models:/models environment: LLAMA_ARG_MODEL: /models/my_model.gguf LLAMA_ARG_CTX_SIZE: 4096 LLAMA_ARG_N_PARALLEL: 2 LLAMA_ARG_ENDPOINT_METRICS: 1 LLAMA_ARG_PORT: 8080 ``` -------------------------------- ### Define GBNF Chess Notation Grammar Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/grammars/README.md An example of a GBNF grammar file defining chess moves. It demonstrates the use of the root rule, sequence concatenation, and alternative production rules. ```GBNF root ::= ( "1. " move " " move "\n" ([1-9] [0-9]? ". " move " " move "\n")+ ) move ::= (pawn | nonpawn | castle) [+#]? pawn ::= ... nonpawn ::= ... castle ::= ... ``` -------------------------------- ### Run LLaVA CLI with LLaVA 1.6 Model Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/multimodal/llava.md Example command to run the `llama-mtmd-cli` binary with a LLaVA 1.6 model. It specifies the GGUF model, the multimodal projector, and notes that LLaVA 1.6 requires more context (e.g., `-c 4096`). ```console ./llama-mtmd-cli -m ../llava-v1.6-vicuna-7b/ggml-model-f16.gguf --mmproj vit/mmproj-model-f16.gguf ``` -------------------------------- ### Run llama-simple-cmake-pkg Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/examples/simple-cmake-pkg/README.md Example command to execute the built binary, providing a path to a GGUF model file and a prompt string. ```shell ./build/llama-simple-cmake-pkg -m ./models/llama-7b-v2/ggml-model-f16.gguf "Hello my name is" ``` -------------------------------- ### Download and Convert OuteTTS Model to GGUF Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/tts/README.md This sequence of commands shows how to download the OuteTTS LLM model from Hugging Face and convert it to the GGUF format required by llama.cpp. It includes cloning the model repository, installing Git LFS, and using the `convert_hf_to_gguf.py` script. ```bash $ pushd models $ git clone --branch main --single-branch --depth 1 https://huggingface.co/OuteAI/OuteTTS-0.2-500M $ cd OuteTTS-0.2-500M && git lfs install && git lfs pull $ popd (venv) python convert_hf_to_gguf.py models/OuteTTS-0.2-500M \ --outfile models/outetts-0.2-0.5B-f16.gguf --outtype f16 ``` -------------------------------- ### Start llama-server and run benchmark Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/server/bench/README.md Launches the llama-server with specific performance flags and executes the k6 load testing script to measure throughput and latency. ```shell llama-server --host localhost --port 8080 --model ggml-model-q4_0.gguf --cont-batching --metrics --parallel 8 --batch-size 512 --ctx-size 4096 -ngl 33 ./k6 run script.js --duration 10m --iterations 500 --vus 8 ``` -------------------------------- ### CMake Build Configuration for Llama C++ Executable Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/cvector-generator/CMakeLists.txt Configures the CMake build system to create an executable named 'llama-cvector-generator'. It specifies the source files, installation rules, linked libraries (common, llama, and threading support), and the required C++ standard (C++17). ```cmake set(TARGET llama-cvector-generator) add_executable(${TARGET} cvector-generator.cpp pca.hpp) install(TARGETS ${TARGET} RUNTIME) target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT}) target_compile_features(${TARGET} PRIVATE cxx_std_17) ``` -------------------------------- ### Convert Zod Object to JSON Schema (JavaScript) Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/grammars/README.md Converts a strict Zod object schema to a JSON schema using `zod-to-json-schema`. This example defines a schema for an object with 'age' and 'email' properties and demonstrates the conversion process. It requires the 'zod' and 'zod-to-json-schema' libraries. ```javascript import { z } from 'zod'; import { zodToJsonSchema } from 'zod-to-json-schema'; const Foo = z.object({ age: z.number().positive(), email: z.string().email(), }).strict(); console.log(zodToJsonSchema(Foo)); ``` -------------------------------- ### Configure llama-tts Executable Build with CMake Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/tts/CMakeLists.txt This snippet configures the build process for the 'llama-tts' executable using CMake. It specifies the source file 'tts.cpp', links against 'llama', 'common', and threading libraries, and sets the C++ standard to C++17. The 'install' command ensures the executable is available at runtime. ```cmake set(TARGET llama-tts) add_executable(${TARGET} tts.cpp) install(TARGETS ${TARGET} RUNTIME) target_link_libraries(${TARGET} PRIVATE llama common ${CMAKE_THREAD_LIBS_INIT}) target_compile_features(${TARGET} PRIVATE cxx_std_17) ``` -------------------------------- ### Configure Executable Target and Link Libraries (CMake) Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/examples/speculative-simple/CMakeLists.txt This CMake snippet defines an executable target named 'llama-speculative-simple', specifies its source file, links necessary libraries including common, llama, and threading support, and sets the C++ standard to C++17. It also includes installation instructions for the target. ```cmake set(TARGET llama-speculative-simple) add_executable(${TARGET} speculative-simple.cpp) install(TARGETS ${TARGET} RUNTIME) target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT}) target_compile_features(${TARGET} PRIVATE cxx_std_17) ``` -------------------------------- ### Create and Enter Fedora Toolbox Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/backend/CUDA-FEDORA.md Commands to initialize a new Fedora 41 toolbox container and enter the environment to begin installation with root privileges. ```bash toolbox create --image registry.fedoraproject.org/fedora-toolbox:41 --container fedora-toolbox-41-cuda toolbox enter --container fedora-toolbox-41-cuda ``` -------------------------------- ### Run LLaVA CLI with LLaVA 1.5 Model Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/multimodal/llava.md Example command to run the `llama-mtmd-cli` binary with a LLaVA 1.5 model. It specifies the model, multimodal projector, and chat template. Lower temperatures (e.g., 0.1) are recommended for better quality, and GPU offloading can be enabled with the `-ngl` flag. ```shell ./llama-mtmd-cli -m ../llava-v1.5-7b/ggml-model-f16.gguf \ --mmproj ../llava-v1.5-7b/mmproj-model-f16.gguf \ --chat-template vicuna ``` -------------------------------- ### Benchmark llama-cli with specific thread and GPU configurations Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/development/token_generation_performance_tips.md Example command used for benchmarking inference speed by adjusting thread counts (-t) and GPU layer offloading (-ngl). This helps determine the optimal balance between CPU utilization and GPU acceleration for a specific hardware setup. ```shell ./llama-cli -m "path/to/model.gguf" -p "An extremely detailed description of the 10 best ethnic dishes will follow, with recipes: " -n 1000 -t 4 -ngl 2000000 ``` -------------------------------- ### Start llama-server Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/server/README.md Commands to launch the server on Unix-based systems and Windows, specifying the model path and context size. ```bash ./llama-server -m models/7B/ggml-model.gguf -c 2048 ``` ```powershell llama-server.exe -m models\7B\ggml-model.gguf -c 2048 ``` -------------------------------- ### JSON Schema for Pydantic Summary Model Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/grammars/README.md This JSON object represents the schema generated for the Pydantic `Summary` model. It details the structure, types, and constraints for `key_facts` (strings starting with '- ' and at least 5 characters long) and `question_answers` (an array of at least 5 `QAPair` objects). The `QAPair` schema is defined in the `$defs` section. ```json { "$defs": { "QAPair": { "additionalProperties": true, "properties": { "question": { "title": "Question", "type": "string" }, "concise_answer": { "title": "Concise Answer", "type": "string" }, "justification": { "title": "Justification", "type": "string" } }, "required": [ "question", "concise_answer", "justification" ], "title": "QAPair", "type": "object" } }, "additionalProperties": true, "properties": { "key_facts": { "items": { "pattern": "^- .{5,}$", "type": "string" }, "title": "Key Facts", "type": "array" }, "question_answers": { "items": { "items": { "$ref": "#/$defs/QAPair" }, "minItems": 5, "type": "array" }, "title": "Question Answers", "type": "array" } }, "required": [ "key_facts", "question_answers" ], "title": "Summary", "type": "object" } ``` -------------------------------- ### POST /completion Endpoint - With Sampling Parameters Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/server/README.md This snippet illustrates calling the /completion endpoint with various sampling parameters such as temperature, top_k, and top_p to control the text generation process. These parameters allow fine-tuning the output's creativity and coherence. ```http POST /completion Content-Type: application/json { "prompt": "Write a short story about a robot.", "temperature": 0.8, "top_k": 40, "top_p": 0.95, "n_predict": 256 } ``` -------------------------------- ### Build LLAMA with SYCL Backend (Intel GPUs) Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/backend/SYCL.md Instructions for building the LLAMA project with the SYCL backend, targeting Intel GPUs. It includes options for FP32 (recommended) and FP16 precision, and how to build the binary. Precision issues can be addressed by setting SYCL_PROGRAM_COMPILE_OPTIONS. ```sh # Option 1: Use FP32 (recommended for better performance in most cases) cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx # Option 2: Use FP16 cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_F16=ON # build all binary cmake --build build --config Release -j -v ``` -------------------------------- ### Install GGUF package via pip Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/gguf-py/README.md Commands to install the GGUF package. The standard installation provides core functionality, while the optional 'gui' extra enables the visual GGUF editor. ```shell pip install gguf pip install gguf[gui] ``` -------------------------------- ### Library Build and Installation (CMake) Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/ggml/CMakeLists.txt Defines the build process for the ggml library, including adding source directories and installing public headers and libraries. It also handles conditional installation of pkgconfig files. ```cmake # # build the library # add_subdirectory(src) # # tests and examples # if (GGML_BUILD_TESTS) enable_testing() add_subdirectory(tests) endif () if (GGML_BUILD_EXAMPLES) add_subdirectory(examples) endif () # # install # include(CMakePackageConfigHelpers) # all public headers set(GGML_PUBLIC_HEADERS include/ggml.h include/ggml-cpu.h include/ggml-alloc.h include/ggml-backend.h include/ggml-blas.h include/ggml-cann.h include/ggml-cpp.h include/ggml-cuda.h include/ggml-opt.h include/ggml-metal.h include/ggml-rpc.h include/ggml-sycl.h include/ggml-vulkan.h include/ggml-webgpu.h include/gguf.h) set_target_properties(ggml PROPERTIES PUBLIC_HEADER "${GGML_PUBLIC_HEADERS}") #if (GGML_METAL) # set_target_properties(ggml PROPERTIES RESOURCE "${CMAKE_CURRENT_SOURCE_DIR}/src/ggml-metal.metal") #endif() install(TARGETS ggml LIBRARY PUBLIC_HEADER) install(TARGETS ggml-base LIBRARY) if (GGML_STANDALONE) configure_file(${CMAKE_CURRENT_SOURCE_DIR}/ggml.pc.in ${CMAKE_CURRENT_BINARY_DIR}/ggml.pc @ONLY) install(FILES ${CMAKE_CURRENT_BINARY_DIR}/ggml.pc DESTINATION share/pkgconfig) endif() ``` -------------------------------- ### Build llama-server with CMake Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/server/README.md Standard build commands for compiling the llama-server binary from the project root using CMake. ```bash cmake -B build cmake --build build --config Release -t llama-server ``` -------------------------------- ### Build llama.cpp using CMake Presets Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/backend/SYCL.md This section demonstrates building llama.cpp using CMake presets for different configurations on Windows. It includes commands for release and debug builds with SYCL support, and targeting the 'llama-cli' executable. The last command shows how to enable FP16 with a specific preset. ```sh cmake --preset x64-windows-sycl-release cmake --build build-x64-windows-sycl-release -j --target llama-cli cmake -DGGML_SYCL_F16=ON --preset x64-windows-sycl-release cmake --build build-x64-windows-sycl-release -j --target llama-cli cmake --preset x64-windows-sycl-debug cmake --build build-x64-windows-sycl-debug -j --target llama-cli ``` -------------------------------- ### Verify CUDA Installation (Bash) Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/backend/CUDA-FEDORA.md Checks if the CUDA installation is successful by querying the NVIDIA CUDA Compiler (nvcc) version. This command should output details about the installed CUDA toolkit, confirming its accessibility. ```bash nvcc --version ``` -------------------------------- ### Build llama-server with SSL support Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/tools/server/README.md Instructions for compiling the llama-server binary with OpenSSL 3 support enabled via CMake flags. ```bash cmake -B build -DLLAMA_SERVER_SSL=ON cmake --build build --config Release -t llama-server ``` -------------------------------- ### GET /props Source: https://context7.com/liquid4all/liquid_llama.cpp/llms.txt Get server properties including loaded model information and chat template. ```APIDOC ## GET /props ### Description Get server properties including loaded model information and chat template. ### Method GET ### Endpoint /props ### Response #### Success Response (200) - **model_info** (object) - Information about the loaded model. - **context_size** (integer) - The context size of the model. - **chat_template** (string) - The chat template used by the model. #### Response Example ```json { "model_info": { "name": "llama-2-7b-chat.gguf", "quantization": "Q4_K_M", "file_type": "GGUF" }, "context_size": 4096, "chat_template": "{{ bos_token }}{% for message in messages %}{% if message.role == 'user' %}{{ 'User: ' + message.content + '\n' }}{% elif message.role == 'assistant' %}{{ 'Assistant: ' + message.content + '\n' }}{% else %}{{ message.content + '\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}" } ``` ``` -------------------------------- ### List Available SYCL Devices (Windows) Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/docs/backend/SYCL.md Lists the available SYCL devices recognized by the system on Windows. This command helps verify the installation of oneAPI and the detection of SYCL-compatible GPUs, such as Intel Level-Zero devices. ```shell sycl-ls.exe ``` -------------------------------- ### Running llama.cpp CLI Examples Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/README.md Demonstrates how to use the `llama-cli` tool for local model inference and downloading models directly from Hugging Face. It requires a local model file or a model identifier from Hugging Face. ```shell # Use a local model file llama-cli -m my_model.gguf # Or download and run a model directly from Hugging Face llama-cli -hf ggml-org/gemma-3-1b-it-GGUF ``` -------------------------------- ### Configure CMake Package and Installation Paths Source: https://github.com/liquid4all/liquid_llama.cpp/blob/master/ggml/CMakeLists.txt This snippet sets up CMake variables for package versioning and installation directories. It configures the package configuration file and writes a version file, ensuring proper installation of CMake modules and version information. ```cmake set(GGML_INSTALL_VERSION 0.0.${GGML_BUILD_NUMBER}) set(GGML_INCLUDE_INSTALL_DIR ${CMAKE_INSTALL_INCLUDEDIR} CACHE PATH "Location of header files") set(GGML_LIB_INSTALL_DIR ${CMAKE_INSTALL_LIBDIR} CACHE PATH "Location of library files") set(GGML_BIN_INSTALL_DIR ${CMAKE_INSTALL_BINDIR} CACHE PATH "Location of binary files") configure_package_config_file( ${CMAKE_CURRENT_SOURCE_DIR}/cmake/ggml-config.cmake.in ${CMAKE_CURRENT_BINARY_DIR}/ggml-config.cmake INSTALL_DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/ggml PATH_VARS GGML_INCLUDE_INSTALL_DIR GGML_LIB_INSTALL_DIR GGML_BIN_INSTALL_DIR) write_basic_package_version_file( ${CMAKE_CURRENT_BINARY_DIR}/ggml-version.cmake VERSION ${GGML_INSTALL_VERSION} COMPATIBILITY SameMajorVersion) target_compile_definitions(ggml-base PRIVATE GGML_VERSION="${GGML_INSTALL_VERSION}" GGML_COMMIT="${GGML_BUILD_COMMIT}" ) message(STATUS "ggml version: ${GGML_INSTALL_VERSION}") message(STATUS "ggml commit: ${GGML_BUILD_COMMIT}") install(FILES ${CMAKE_CURRENT_BINARY_DIR}/ggml-config.cmake ${CMAKE_CURRENT_BINARY_DIR}/ggml-version.cmake DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/ggml) ```