### Local Development Setup and Build Source: https://github.com/abetlen/llama-cpp-python/blob/main/CONTRIBUTING.md Commands to initialize git submodules, set up a virtual environment, install dependencies, and build the project locally. ```bash git submodule update --init --recursive python3 -m venv .venv source .venv/bin/activate python -m pip install --upgrade pip make deps make build ``` -------------------------------- ### Install llama-cpp-python with server support Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/server.md Install the library with the server extra to enable the web server functionality. ```bash pip install llama-cpp-python[server] ``` -------------------------------- ### Install with RPC Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Install with RPC support. Ensure the oneAPI environment variables are sourced and set CMAKE_ARGS. ```bash source /opt/intel/oneapi/setvars.sh CMAKE_ARGS="-DGGML_RPC=on" pip install llama-cpp-python ``` -------------------------------- ### Install with Vulkan Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Install with Vulkan support. Set the GGML_VULKAN environment variable before installation. ```bash CMAKE_ARGS="-DGGML_VULKAN=on" pip install llama-cpp-python ``` -------------------------------- ### Install with Pre-built Wheel (CPU Support) Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Install a pre-built wheel for llama-cpp-python that includes basic CPU support. ```bash pip install llama-cpp-python \ --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu ``` -------------------------------- ### Python Instructor Library Installation Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Functions.ipynb Provides instructions on how to install the instructor library using pip. This library simplifies function calling with AI models. ```bash pip install instructor ``` -------------------------------- ### Install and Run OpenAI Compatible Web Server Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Install the server package and run the web server using `llama_cpp.server`. Specify the model path and optionally GPU layers. ```bash pip install 'llama-cpp-python[server]' python3 -m llama_cpp.server --model models/7B/llama-model.gguf ``` -------------------------------- ### Install with SYCL Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Install with SYCL support. Ensure the oneAPI environment variables are sourced and set CMAKE_C_COMPILER and CMAKE_CXX_COMPILER. ```bash source /opt/intel/oneapi/setvars.sh CMAKE_ARGS="-DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python ``` -------------------------------- ### Install with OpenBLAS (CPU) Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Use this command to install with OpenBLAS support for CPU acceleration. Ensure GGML_BLAS and GGML_BLAS_VENDOR are set. ```bash CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python ``` -------------------------------- ### Install and Run Web Server with GPU Support Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Install the server package with CUDA support and run the web server, specifying the model path and number of GPU layers. ```bash CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install 'llama-cpp-python[server]' python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35 ``` -------------------------------- ### Verify and Install Xcode Command Line Tools Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md Checks the current Xcode installation path and installs the command line tools if they are missing. This is a prerequisite for compiling the C++ components of the library. ```bash xcode-select -p xcode-select --install ``` -------------------------------- ### Run server with chat template configuration Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/server.md Start the server and specify a chat format and template arguments for custom chat interactions. ```bash python3 -m llama_cpp.server \ --model \ --chat_format chatml \ --chat_template_kwargs '{"enable_thinking": true}' ``` -------------------------------- ### Run the llama-cpp-python server Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/server.md Start the web server by specifying the path to your model. All server options are available as environment variables. ```bash python3 -m llama_cpp.server --model ``` -------------------------------- ### Run Llama.cpp Server with Configuration File Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/server.md Start the llama-cpp-python server by providing a path to a JSON configuration file using the --config_file argument. ```bash python3 -m llama_cpp.server --config_file ``` -------------------------------- ### Install with HIP (ROCm) Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Install with HIP / ROCm support for AMD cards. Set the GGML_HIP environment variable before installation. ```bash CMAKE_ARGS="-DGGML_HIP=on" pip install llama-cpp-python ``` -------------------------------- ### Install llama-cpp-python Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Install the llama-cpp-python package using pip. This command also builds llama.cpp from source. ```bash pip install llama-cpp-python ``` -------------------------------- ### Clone Repository and Install in Editable Mode Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Steps to clone the llama-cpp-python repository and install it in editable mode for development. Includes upgrading pip and installing optional dependencies. ```bash git clone --recurse-submodules https://github.com/abetlen/llama-cpp-python.git cd llama-cpp-python # Upgrade pip (required for editable mode) pip install --upgrade pip # Install with pip pip install -e . # install development tooling (tests, docs, ruff) pip install -e '.[dev]' # if you want to use the fastapi / openapi server pip install -e '.[server]' # to install all optional dependencies pip install -e '.[all]' # to clear the local build cache make clean ``` -------------------------------- ### Install Dependencies for Ray Serve Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/ray/README.md Installs the necessary Python packages required to run the LLM inference project using Ray. ```bash pip install -r requirements.txt ``` -------------------------------- ### Run server for code completion with increased context Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/server.md Start the server with a larger context size, necessary for handling GitHub Copilot requests. ```bash python3 -m llama_cpp.server --model --n_ctx 16192 ``` -------------------------------- ### Example Usage of Hermes Prompt Generation Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/OpenHermesFunctionCalling.ipynb Demonstrates how to use the `generate_hermes_prompt` function with a list of sample prompts and associated functions. It iterates through the prompts, generates a formatted Hermes prompt for each, and prints the result. This example showcases practical application of the prompt generation logic. ```python prompts = [ "What's the weather in 10001?", "Determine the monthly mortgage payment for a loan amount of $200,000, an interest rate of 4%, and a loan term of 30 years.", "What's the current exchange rate for USD to EUR?", ] functions = [get_weather, calculate_mortgage_payment, get_article_details] for prompt in prompts: print(generate_hermes_prompt(prompt, functions)) ``` -------------------------------- ### Windows Installation with w64devkit Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Troubleshoot Windows installation errors by setting CMAKE_GENERATOR and CMAKE_ARGS to include w64devkit paths for GCC compilers. ```powershell $env:CMAKE_GENERATOR = "MinGW Makefiles" $env:CMAKE_ARGS = "-DGGML_OPENBLAS=on -DCMAKE_C_COMPILER=C:/w64devkit/bin/gcc.exe -DCMAKE_CXX_COMPILER=C:/w64devkit/bin/g++.exe" ``` -------------------------------- ### Install Pre-built CUDA Wheel Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Install a pre-built wheel for CUDA support. Replace with the appropriate CUDA version identifier (e.g., cu118, cu121). ```bash pip install llama-cpp-python \ --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/ ``` ```bash pip install llama-cpp-python \ --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121 ``` -------------------------------- ### Install Miniforge and Create Conda Environment Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md Downloads and installs the Miniforge distribution for MacOS ARM64 and creates a dedicated Python 3.9.16 environment for the project. ```bash wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh bash Miniforge3-MacOSX-arm64.sh conda create -n llama python=3.9.16 conda activate llama ``` -------------------------------- ### Install Pre-built Metal Wheel Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Install a pre-built wheel for Metal support on macOS. Ensure your system meets the macOS and Python version requirements. ```bash pip install llama-cpp-python \ --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal ``` -------------------------------- ### Install with CUDA Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Install with CUDA support by setting the GGML_CUDA environment variable. Ensure your system meets the CUDA and Python version requirements. ```bash CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python ``` -------------------------------- ### Python Basic Instructor Usage Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Functions.ipynb Demonstrates the basic setup for using the instructor library with Pydantic models. This involves importing necessary classes and defining a Pydantic model for structured output. ```python import instructor from pydantic import BaseModel ``` -------------------------------- ### Changelog Entry Examples Source: https://github.com/abetlen/llama-cpp-python/blob/main/CONTRIBUTING.md Examples of how to format changelog entries for pull requests, including the tag, scope, description, contributor, and issue number. ```markdown - feat(server): add support for X by @contributor in #1234 - fix(ci): repair Y wheel builds by @contributor in #1234 ``` -------------------------------- ### Install Python for Apple Silicon (M1 Mac) Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Install a compatible Python version for Apple Silicon (M1) Macs to avoid performance issues. Use Miniforge for arm64 architecture. ```bash wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh bash Miniforge3-MacOSX-arm64.sh ``` -------------------------------- ### Build and Start Open-Llama-in-a-box Source: https://github.com/abetlen/llama-cpp-python/blob/main/docker/README.md Automated scripts to download a 3B parameter Open LLaMA model and launch an OpenBLAS-enabled server container. ```bash cd ./open_llama ./build.sh ./start.sh ``` -------------------------------- ### Backend-Specific Build Targets Source: https://github.com/abetlen/llama-cpp-python/blob/main/CONTRIBUTING.md Examples of make targets for building the project with specific native acceleration backends like OpenBLAS, CUDA, Metal, or Vulkan. ```bash make build.openblas make build.cuda make build.metal make build.vulkan ``` -------------------------------- ### Run server for function calling with Hugging Face tokenizer Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/server.md Start the server for function calling, specifying the model path and the path to the Hugging Face tokenizer. ```bash python3 -m llama_cpp.server --model --chat_format functionary-v2 --hf_pretrained_model_name_or_path ``` -------------------------------- ### Install llama-cpp-python with Metal Support Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md Uninstalls existing versions and installs the latest llama-cpp-python with the GGML_METAL flag enabled to ensure GPU acceleration is compiled. ```bash pip uninstall llama-cpp-python -y CMAKE_ARGS="-DGGML_METAL=on" pip install -U llama-cpp-python --no-cache-dir pip install 'llama-cpp-python[server]' ``` -------------------------------- ### Deploy GGUF Model with Ray Serve Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/ray/README.md Starts the Ray Serve application to host a GGUF model at a local API endpoint. ```bash serve run llm:llm_builder model_path='../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf' ``` -------------------------------- ### Run Llama.cpp Server with Multimodal Model Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/server.md Use this command to start the server with a multimodal model. Ensure you specify the paths for both the main model and the clip model, and set the correct chat format for llava-1.5. ```bash python3 -m llama_cpp.server --model --clip_model_path --chat_format llava-1-5 ``` -------------------------------- ### Download Llama Model from Hugging Face Hub Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Shows how to download a Llama model in GGUF format directly from Hugging Face using the `from_pretrained` method. Requires the `huggingface-hub` package to be installed. ```python llm = Llama.from_pretrained( repo_id="lmstudio-community/Qwen3.5-0.8B-GGUF", filename="*Q8_0.gguf", verbose=False ) ``` -------------------------------- ### Tokenize Prompt with Low-Level API Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Demonstrates how to use the low-level ctypes binding to tokenize a prompt using the llama.cpp C API. Ensure llama_backend_init() is called once at the start. ```python import llama_cpp import ctypes llama_cpp.llama_backend_init() # Must be called once at the start of each program model_params = llama_cpp.llama_model_default_params() ctx_params = llama_cpp.llama_context_default_params() prompt = b"Q: Name the planets in the solar system? A: " # use bytes for char * params model = llama_cpp.llama_model_load_from_file(b"./models/7b/llama-model.gguf", model_params) ctx = llama_cpp.llama_init_from_model(model, ctx_params) vocab = llama_cpp.llama_model_get_vocab(model) max_tokens = ctx_params.n_ctx # use ctypes arrays for array params tokens = (llama_cpp.llama_token * int(max_tokens))() n_tokens = llama_cpp.llama_tokenize(vocab, prompt, len(prompt), tokens, max_tokens, True, False) lama_cpp.llama_free(ctx) lama_cpp.llama_model_free(model) ``` -------------------------------- ### Install llama-cpp-python for M Series Mac Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Troubleshoot architecture compatibility errors on M Series Macs by specifying arm64 architecture and enabling Metal support. ```bash CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DGGML_METAL=on" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python ``` -------------------------------- ### Langchain OpenAI LLM Interaction Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Clients.ipynb This example demonstrates using Langchain's OpenAI LLM wrapper to generate text. It involves setting the OPENAI_API_KEY and OPENAI_API_BASE environment variables. The `OpenAI` class is instantiated, and then called directly with a prompt and stop sequences to generate text. The output is a string containing the generated text. ```python import os os.environ["OPENAI_API_KEY"] = ( "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # can be anything ) os.environ["OPENAI_API_BASE"] = "http://100.64.159.73:8000/v1" from langchain.llms import OpenAI llms = OpenAI() llms( prompt="The quick brown fox jumps", stop=[ ".", "\n"], ) ``` -------------------------------- ### Configure Build with Pip CLI Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Configure the llama.cpp build with specific CMake arguments using the pip install command with the -C flag. ```bash pip install --upgrade pip # ensure pip is up to date pip install llama-cpp-python \ -C cmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS" ``` -------------------------------- ### Moondream2 Multi-modal Chat Completion from Hub Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Example of loading a Moondream2 model from Hugging Face Hub using from_pretrained and MoondreamChatHandler. Adjust filenames and ensure n_ctx is sufficient for image data. ```python from llama_cpp import Llama from llama_cpp.llama_chat_format import MoondreamChatHandler chat_handler = MoondreamChatHandler.from_pretrained( repo_id="vikhyatk/moondream2", filename="*mmproj*", ) llm = Llama.from_pretrained( repo_id="vikhyatk/moondream2", filename="*text-model*", chat_handler=chat_handler, n_ctx=2048, # n_ctx should be increased to accommodate the image embedding ) response = llm.create_chat_completion( messages = [ { "role": "user", "content": [ {"type" : "text", "text": "What's in this image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } } ] } ] ) print(response["choices"][0]["text"]) ``` -------------------------------- ### Display server help and options Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/server.md View all available command-line options for configuring the server by running the help command. ```bash python3 -m llama_cpp.server --help ``` -------------------------------- ### Install with Metal (MPS) Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Install with Metal (MPS) support for macOS. Set the GGML_METAL environment variable before installation. ```bash CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python ``` -------------------------------- ### CMake: Install Target Function for llama-cpp-python Source: https://github.com/abetlen/llama-cpp-python/blob/main/CMakeLists.txt Defines a reusable CMake function to install targets (libraries) for the llama-cpp-python project. It handles installation to both the source directory and the platform-specific library directory, setting RPATH properties for dynamic linking. ```cmake function(llama_cpp_python_install_target target) if(NOT TARGET ${target}) return() endif() install( TARGETS ${target} LIBRARY DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib RUNTIME DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib ARCHIVE DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib FRAMEWORK DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib RESOURCE DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib ) install( TARGETS ${target} LIBRARY DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib RUNTIME DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib ARCHIVE DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib FRAMEWORK DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib RESOURCE DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib ) set_target_properties(${target} PROPERTIES INSTALL_RPATH "$ORIGIN" BUILD_WITH_INSTALL_RPATH TRUE ) if(UNIX) if(APPLE) set_target_properties(${target} PROPERTIES INSTALL_RPATH "@loader_path" BUILD_WITH_INSTALL_RPATH TRUE ) else() set_target_properties(${target} PROPERTIES INSTALL_RPATH "$ORIGIN" BUILD_WITH_INSTALL_RPATH TRUE ) endif() endif() endfunction() ``` -------------------------------- ### Typical Development Workflow Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Common Makefile targets for building and testing the project during development. ```bash make build make test ``` -------------------------------- ### Run Web Server with ChatML Prompt Format Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Run the web server and specify the prompt format, such as `chatml`, to ensure correct prompt formatting for the model. ```bash python3 -m llama_cpp.server --model models/7B/llama-model.gguf --chat_format chatml ``` -------------------------------- ### Install Dependencies for Llama_cpp Python Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/low_level_api/readme/low_level_api_llama_cpp.md Installs the necessary Python packages for using the llama_cpp library, including llama-cpp-python, ctypes, os, and multiprocessing. ```bash python -m pip install llama-cpp-python ctypes os multiprocessing ``` -------------------------------- ### Use Guidance Library for Programmatic Text Generation in Python Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Guidance.ipynb Demonstrates how to use the guidance library to define and execute a program that adapts a proverb. It shows setting up the language model, defining a guidance program with placeholders and generation commands, and executing it with specific inputs. ```python import guidance # set the default language model used to execute guidance programs guidance.llm = guidance.llms.OpenAI("text-davinci-003", caching=False) # define a guidance program that adapts a proverb program = guidance( """Tweak this proverb to apply to model instructions instead. {{proverb}} - {{book}} {{chapter}}:{{verse}} UPDATED Where there is no guidance{{gen 'rewrite' stop="\n-"}} - GPT {{gen 'chapter'}}:{{gen 'verse'}}""" ) # execute the program on a specific proverb executed_program = program( proverb="Where there is no guidance, a people falls,\nbut in an abundance of counselors there is safety.", book="Proverbs", chapter=11, verse=14, ) ``` -------------------------------- ### Run Web Server with Custom Host and Port Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Run the web server, binding to a specific host (e.g., `0.0.0.0` for remote connections) and port. ```bash python3 -m llama_cpp.server --host 0.0.0.0 --port 8000 ``` -------------------------------- ### Run llama-cpp-python API Server Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md Configures the environment variable for the model path and launches the API server with GPU layer offloading enabled. ```bash export MODEL=[path to your llama.cpp ggml models]/[ggml-model-name]Q4_0.gguf python3 -m llama_cpp.server --model $MODEL --n_gpu_layers 1 ``` -------------------------------- ### Run Web Server using Docker Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Run the llama-cpp-python web server using a Docker container, mapping ports and mounting a volume for models. ```bash docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest ``` -------------------------------- ### Install Runtime DLLs for ggml Library (CMake) Source: https://github.com/abetlen/llama-cpp-python/blob/main/CMakeLists.txt This CMake code snippet installs the runtime DLLs for the 'ggml' target. It specifies the destination directories for these files, ensuring they are available in the build environment. ```cmake install( FILES $ DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib ) install( FILES $ DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib ) ``` -------------------------------- ### Llama Class Initialization and Text Completion Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Demonstrates how to initialize the Llama class with a model path and perform basic text completion. It also shows the expected output format for text completions. ```APIDOC ## Llama Class Initialization and Text Completion ### Description Initializes the `Llama` class with a specified model path and performs text completion. The example shows how to set parameters like `max_tokens`, `stop` sequences, and `echo`. ### Method `Llama(model_path: str, ...)` for initialization, `llm(prompt: str, ...)` for text completion. ### Parameters (Initialization) - **model_path** (str) - Required - Path to the GGUF model file. - **n_gpu_layers** (int) - Optional - Number of layers to offload to the GPU. - **seed** (int) - Optional - Seed for reproducibility. - **n_ctx** (int) - Optional - Context window size. ### Parameters (Text Completion) - **prompt** (str) - Required - The input prompt for text generation. - **max_tokens** (int) - Optional - Maximum number of tokens to generate. - **stop** (list[str]) - Optional - Sequences that will cause the generation to stop. - **echo** (bool) - Optional - Whether to echo the prompt in the output. ### Request Example (Initialization and Call) ```python from llama_cpp import Llama llm = Llama( model_path="./models/7B/llama-model.gguf", # n_gpu_layers=-1, # Uncomment to use GPU acceleration # seed=1337, # Uncomment to set a specific seed # n_ctx=2048, # Uncomment to increase the context window ) output = llm( "Q: Name the planets in the solar system? A: ", # Prompt max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window stop=["Q:", "\n"], # Stop generating just before the model would generate a new question echo=True # Echo the prompt back in the output ) print(output) ``` ### Response Example (Text Completion) ```json { "id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", "object": "text_completion", "created": 1679561337, "model": "./models/7B/llama-model.gguf", "choices": [ { "text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.", "index": 0, "logprobs": null, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 14, "completion_tokens": 28, "total_tokens": 42 } } ``` ``` -------------------------------- ### Text Completion Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/PerformanceTuning.ipynb Shows an example of a text completion request and its response. ```APIDOC ## Text Completion API ### Description This API endpoint is used to generate text completions based on a given prompt. It utilizes a pre-loaded Llama model to produce coherent and contextually relevant text. ### Method POST ### Endpoint `/v1/completions` (Hypothetical endpoint for demonstration) ### Parameters #### Request Body - **model** (string) - Required - The name or path of the model to use for completion. - **prompt** (string) - Required - The input text prompt to generate a completion for. - **max_tokens** (integer) - Optional - The maximum number of tokens to generate. - **temperature** (number) - Optional - Controls randomness. Lower values make output more deterministic. ### Request Example ```json { "model": "../models/ggml-model.bin", "prompt": "### Instructions:\nYou are a helpful assistant.\nYou answer questions truthfully and politely.\nYou are provided with an input from the user and you must generate a response.\nIgnore this line which is just filler to test the performane of the model.\n### Inputs:\nWhat is the capital of France?\n### Response:\nThe", "max_tokens": 10 } ``` ### Response #### Success Response (200) - **id** (string) - Unique identifier for the completion. - **object** (string) - Type of object returned (e.g., "text_completion"). - **created** (integer) - Timestamp of creation. - **model** (string) - The model used for completion. - **choices** (array) - An array of completion choices. - **text** (string) - The generated text completion. - **index** (integer) - Index of the choice. - **logprobs** (null) - Placeholder for log probabilities (currently null). - **finish_reason** (string) - Reason for finishing the generation (e.g., "length"). - **usage** (object) - Information about token usage. - **prompt_tokens** (integer) - Number of tokens in the prompt. - **completion_tokens** (integer) - Number of tokens in the completion. - **total_tokens** (integer) - Total tokens used. #### Response Example ```json { "id": "cmpl-e623667d-d6cc-4908-a648-60380f723592", "object": "text_completion", "created": 1680227881, "model": "../models/ggml-model.bin", "choices": [ { "text": " capital of France is Paris.", "index": 0, "logprobs": null, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 79, "completion_tokens": 6, "total_tokens": 85 } } ``` ``` -------------------------------- ### Text Completion Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/PerformanceTuning.ipynb Shows an example of generating text completions using a loaded Llama model. ```APIDOC ## Text Completion API ### Description This endpoint allows users to generate text completions based on a given prompt. It utilizes a pre-loaded Llama model to produce relevant and coherent text. ### Method POST ### Endpoint `/v1/completions` ### Parameters #### Query Parameters None #### Request Body - **model** (string) - Required - The ID or path of the model to use for completion. - **prompt** (string) - Required - The input text prompt for the model. - **max_tokens** (integer) - Optional - The maximum number of tokens to generate. - **temperature** (number) - Optional - Controls randomness. Lower values make output more deterministic. ### Request Example ```json { "model": "../models/ggml-model.bin", "prompt": "### Instructions:\nYou are a helpful assistant.\nYou answer questions truthfully and politely.\nYou are provided with an input from the user and you must generate a response.\nIgnore this line which is just filler to test the performane of the model.\n### Inputs:\nWhat is the capital of France?\n### Response:\nThe", "max_tokens": 10 } ``` ### Response #### Success Response (200) - **id** (string) - Unique identifier for the completion. - **object** (string) - Type of object returned (e.g., "text_completion"). - **created** (integer) - Timestamp of creation. - **model** (string) - The model used for completion. - **choices** (array) - An array of completion choices. - **text** (string) - The generated text completion. - **index** (integer) - The index of the choice. - **logprobs** (null) - Log probabilities (currently null). - **finish_reason** (string) - The reason the generation finished (e.g., "length"). - **usage** (object) - Usage statistics. - **prompt_tokens** (integer) - Number of tokens in the prompt. - **completion_tokens** (integer) - Number of tokens generated. - **total_tokens** (integer) - Total tokens used. #### Response Example ```json { "id": "cmpl-f8d90e63-4939-491c-9775-fc15aa55505e", "object": "text_completion", "created": 1680228062, "model": "../models/ggml-model.bin", "choices": [ { "text": " ### Instructions:\nYou are a helpful assistant.\nYou answer questions truthfully and politely.\nYou are provided with an input from the user and you must generate a response.\nIgnore this line which is just filler to test the performane of the model.\n### Inputs:\nWhat is the capital of France?\n### Response:\nThe", "index": 0, "logprobs": null, "finish_reason": "length" } ], "usage": { "prompt_tokens": 79, "completion_tokens": 1, "total_tokens": 80 } } ``` ``` -------------------------------- ### Function Calling with OpenAI Python Client Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Functions.ipynb Demonstrates how to set up the OpenAI client to interact with a running llama-cpp-python server for function calling. ```APIDOC ## Function Calling with OpenAI Python Client ### Description This section provides a basic demonstration of setting up the OpenAI Python Client to communicate with a `llama-cpp-python` server that supports function calling. It includes the necessary imports and client initialization. ### Method N/A (Client Setup) ### Endpoint N/A (Client Setup) ### Parameters N/A ### Request Example ```python import openai import json client = openai.OpenAI( api_key="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", # can be anything base_url="http://100.64.159.73:8000/v1", # NOTE: Replace with IP address and port of your llama-cpp-python server ) # Example dummy function hard coded to return the same weather ``` ### Response N/A (Client Setup) ### Response Example N/A (Client Setup) ``` -------------------------------- ### Load Model from Hugging Face Hub Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Run the web server and load a model from the Hugging Face Hub using the `--hf_model_repo_id` flag and a model file pattern. ```bash python3 -m llama_cpp.server --hf_model_repo_id lmstudio-community/Qwen3.5-0.8B-GGUF --model '*Q8_0.gguf' ``` -------------------------------- ### Low-Level API - LLAMA constants Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/api-reference.md Exposes low-level Python bindings for llama.cpp using ctypes, filtering constants starting with `LLAMA_`. ```APIDOC ## Low-Level API - LLAMA constants This section exposes low-level Python bindings for llama.cpp using Python's ctypes library. It includes constants that start with `LLAMA_`. ``` -------------------------------- ### Low-Level API - llama_cpp functions Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/api-reference.md Exposes low-level Python bindings for llama.cpp using ctypes, filtering functions starting with `llama_`. ```APIDOC ## Low-Level API - llama_cpp functions This section exposes low-level Python bindings for llama.cpp using Python's ctypes library. It includes functions that start with `llama_`. ``` -------------------------------- ### Initialize Llama Backend Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Batching.ipynb Initialize the Llama backend. Set numa to False if NUMA is not required or should be disabled. ```python llama_cpp.llama_backend_init(numa=False) ``` -------------------------------- ### Basic Text Completion with Llama Class Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Demonstrates how to perform basic text completion using the Llama class. Configure model path, GPU layers, seed, and context window. Control generation with max tokens, stop sequences, and echo prompt. ```python from llama_cpp import Llama llm = Llama( model_path="./models/7B/llama-model.gguf", # n_gpu_layers=-1, # Uncomment to use GPU acceleration # seed=1337, # Uncomment to set a specific seed # n_ctx=2048, # Uncomment to increase the context window ) output = llm( "Q: Name the planets in the solar system? A: ", # Prompt max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window stop=["Q:", "\n"], # Stop generating just before the model would generate a new question echo=True # Echo the prompt back in the output ) # Generate a completion, can also call create_completion print(output) ``` -------------------------------- ### Model Loading and Initialization Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/PerformanceTuning.ipynb Demonstrates the process of loading a Llama model using the llama_model_load function and initializing the context. ```APIDOC ## Model Loading ### Description Loads a Llama model from a specified file path and initializes the necessary structures for inference. ### Method Internal Library Function ### Endpoint N/A (Local Library Function) ### Parameters #### Path Parameters - **model_path** (string) - Required - The file path to the GGML model file. ### Request Example ``` llama_model_load(model_path='../models/ggml-model.bin') ``` ### Response #### Success Response (200) - **model_info** (object) - Contains details about the loaded model, such as vocabulary size, context length, embedding dimensions, etc. - **context_info** (object) - Contains details about the initialized context, such as KV cache size. #### Response Example ```json { "model_info": { "n_vocab": 32000, "n_ctx": 512, "n_embd": 4096, "n_mult": 256, "n_head": 32, "n_layer": 32, "n_rot": 128, "f16": 2, "n_ff": 11008, "n_parts": 1, "type": 1 }, "context_info": { "ggml_map_size": "4017.70 MB", "ggml_ctx_size": "81.25 KB", "mem_required": "5809.78 MB (+ 1026.00 MB per state)", "kv_self_size": "256.00 MB" } } ``` ``` -------------------------------- ### Llava1.5 Multi-modal Chat Completion Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Example of using Llava1.5ChatHandler for multi-modal chat completions. Ensure the clip_model_path is correctly set and n_ctx is increased for image embeddings. ```python from llama_cpp import Llama from llama_cpp.llama_chat_format import Llava15ChatHandler chat_handler = Llava15ChatHandler(clip_model_path="path/to/llava/mmproj.bin") llm = Llama( model_path="./path/to/llava/llama-model.gguf", chat_handler=chat_handler, n_ctx=2048, # n_ctx should be increased to accommodate the image embedding ) llm.create_chat_completion( messages = [ {"role": "system", "content": "You are an assistant who perfectly describes images."}, { "role": "user", "content": [ {"type" : "text", "text": "What's in this image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } } ] } ] ) ``` -------------------------------- ### Initialize Sampler Chain Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Batching.ipynb Adds various samplers (top-k, top-p, temperature, distribution) to a sampler chain for controlling text generation. Ensure llama_cpp library is imported. ```python llama_cpp.llama_sampler_chain_add(sampler_chain, llama_cpp.llama_sampler_init_top_k(40)) lama_cpp.llama_sampler_chain_add(sampler_chain, llama_cpp.llama_sampler_init_top_p(0.9, 1)) lama_cpp.llama_sampler_chain_add(sampler_chain, llama_cpp.llama_sampler_init_temp(0.4)) lama_cpp.llama_sampler_chain_add(sampler_chain, llama_cpp.llama_sampler_init_dist(1234)) # Final "dist" sampler ``` -------------------------------- ### Speculative Decoding with Draft Model Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md Example of initializing Llama with a draft model for speculative decoding. The num_pred_tokens parameter can be tuned for performance based on the hardware (GPU vs CPU). ```python from llama_cpp import Llama from llama_cpp.llama_speculative import LlamaPromptLookupDecoding lama = Llama( model_path="path/to/model.gguf", draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines. ) ``` -------------------------------- ### API Completion Response JSON Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/PerformanceTuning.ipynb Example of a successful text completion response from the llama-cpp-python API. It includes the completion ID, model path, generated text, and token usage statistics. ```json { "id": "cmpl-bc5dc1ba-f7ce-441c-a558-5005f2fb89b9", "object": "text_completion", "created": 1680227366, "model": "../models/ggml-model.bin", "choices": [ { "text": " ### Instructions:\nYou are a helpful assistant.\nYou answer questions truthfully and politely.\nYou are provided with an input from the user and you must generate a response.\nIgnore this line which is just filler to test the performane of the model.\n### Inputs:\nWhat is the capital of France?\n### Response:\nThe", "index": 0, "logprobs": null, "finish_reason": "length" } ], "usage": { "prompt_tokens": 79, "completion_tokens": 1, "total_tokens": 80 } } ``` -------------------------------- ### Initialize OpenAI Client for llama-cpp-python Server Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Functions.ipynb This code snippet demonstrates how to initialize the OpenAI Python client to connect to a running llama-cpp-python server. It requires the server's IP address and port, and the API key can be any string. ```python import openai import json client = openai.OpenAI( api_key="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", # can be anything base_url="http://100.64.159.73:8000/v1", # NOTE: Replace with IP address and port of your llama-cpp-python server ) # Example dummy function hard coded to return the same weather ``` -------------------------------- ### API Completion Response JSON Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/PerformanceTuning.ipynb Example of the JSON object returned by the API after a text completion request. It includes the completion ID, model path, generated text choices, and token usage metrics. ```json { "id": "cmpl-2623073e-004f-4386-98e0-7e6ea617523a", "object": "text_completion", "created": 1680227558, "model": "../models/ggml-model.bin", "choices": [ { "text": " ### Instructions:\nYou are a helpful assistant.\nYou answer questions truthfully and politely.\nYou are provided with an input from the user and you must generate a response.\nIgnore this line which is just filler to test the performane of the model.\n### Inputs:\nWhat is the capital of France?\n### Response:\nThe", "index": 0, "logprobs": null, "finish_reason": "length" } ], "usage": { "prompt_tokens": 79, "completion_tokens": 1, "total_tokens": 80 } } ```