### Local Development Setup and Build

Source: https://github.com/abetlen/llama-cpp-python/blob/main/CONTRIBUTING.md

Commands to initialize git submodules, set up a virtual environment, install dependencies, and build the project locally.

```bash
git submodule update --init --recursive
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
make deps
make build
```

--------------------------------

### Install llama-cpp-python with server support

Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/server.md

Install the library with the server extra to enable the web server functionality.

```bash
pip install llama-cpp-python[server]
```

--------------------------------

### Install with RPC

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Install with RPC support. Ensure the oneAPI environment variables are sourced and set CMAKE_ARGS.

```bash
source /opt/intel/oneapi/setvars.sh   
CMAKE_ARGS="-DGGML_RPC=on" pip install llama-cpp-python
```

--------------------------------

### Install with Vulkan

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Install with Vulkan support. Set the GGML_VULKAN environment variable before installation.

```bash
CMAKE_ARGS="-DGGML_VULKAN=on" pip install llama-cpp-python
```

--------------------------------

### Install with Pre-built Wheel (CPU Support)

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Install a pre-built wheel for llama-cpp-python that includes basic CPU support.

```bash
pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
```

--------------------------------

### Python Instructor Library Installation

Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Functions.ipynb

Provides instructions on how to install the instructor library using pip. This library simplifies function calling with AI models.

```bash
pip install instructor
```

--------------------------------

### Install and Run OpenAI Compatible Web Server

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Install the server package and run the web server using `llama_cpp.server`. Specify the model path and optionally GPU layers.

```bash
pip install 'llama-cpp-python[server]'
python3 -m llama_cpp.server --model models/7B/llama-model.gguf
```

--------------------------------

### Install with SYCL

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Install with SYCL support. Ensure the oneAPI environment variables are sourced and set CMAKE_C_COMPILER and CMAKE_CXX_COMPILER.

```bash
source /opt/intel/oneapi/setvars.sh   
CMAKE_ARGS="-DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python
```

--------------------------------

### Install with OpenBLAS (CPU)

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Use this command to install with OpenBLAS support for CPU acceleration. Ensure GGML_BLAS and GGML_BLAS_VENDOR are set.

```bash
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
```

--------------------------------

### Install and Run Web Server with GPU Support

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Install the server package with CUDA support and run the web server, specifying the model path and number of GPU layers.

```bash
CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install 'llama-cpp-python[server]'
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35
```

--------------------------------

### Verify and Install Xcode Command Line Tools

Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md

Checks the current Xcode installation path and installs the command line tools if they are missing. This is a prerequisite for compiling the C++ components of the library.

```bash
xcode-select -p
xcode-select --install
```

--------------------------------

### Run server with chat template configuration

Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/server.md

Start the server and specify a chat format and template arguments for custom chat interactions.

```bash
python3 -m llama_cpp.server \
  --model <model_path> \
  --chat_format chatml \
  --chat_template_kwargs '{"enable_thinking": true}'
```

--------------------------------

### Run the llama-cpp-python server

Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/server.md

Start the web server by specifying the path to your model. All server options are available as environment variables.

```bash
python3 -m llama_cpp.server --model <model_path>
```

--------------------------------

### Run Llama.cpp Server with Configuration File

Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/server.md

Start the llama-cpp-python server by providing a path to a JSON configuration file using the --config_file argument.

```bash
python3 -m llama_cpp.server --config_file <config_file>
```

--------------------------------

### Install with HIP (ROCm)

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Install with HIP / ROCm support for AMD cards. Set the GGML_HIP environment variable before installation.

```bash
CMAKE_ARGS="-DGGML_HIP=on" pip install llama-cpp-python
```

--------------------------------

### Install llama-cpp-python

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Install the llama-cpp-python package using pip. This command also builds llama.cpp from source.

```bash
pip install llama-cpp-python
```

--------------------------------

### Clone Repository and Install in Editable Mode

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Steps to clone the llama-cpp-python repository and install it in editable mode for development. Includes upgrading pip and installing optional dependencies.

```bash
git clone --recurse-submodules https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python

# Upgrade pip (required for editable mode)
pip install --upgrade pip

# Install with pip
pip install -e .

# install development tooling (tests, docs, ruff)
pip install -e '.[dev]'

# if you want to use the fastapi / openapi server
pip install -e '.[server]'

# to install all optional dependencies
pip install -e '.[all]'

# to clear the local build cache
make clean
```

--------------------------------

### Install Dependencies for Ray Serve

Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/ray/README.md

Installs the necessary Python packages required to run the LLM inference project using Ray.

```bash
pip install -r requirements.txt
```

--------------------------------

### Run server for code completion with increased context

Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/server.md

Start the server with a larger context size, necessary for handling GitHub Copilot requests.

```bash
python3 -m llama_cpp.server --model <model_path> --n_ctx 16192
```

--------------------------------

### Example Usage of Hermes Prompt Generation

Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/OpenHermesFunctionCalling.ipynb

Demonstrates how to use the `generate_hermes_prompt` function with a list of sample prompts and associated functions. It iterates through the prompts, generates a formatted Hermes prompt for each, and prints the result. This example showcases practical application of the prompt generation logic.

```python
prompts = [
    "What's the weather in 10001?",
    "Determine the monthly mortgage payment for a loan amount of $200,000, an interest rate of 4%, and a loan term of 30 years.",
    "What's the current exchange rate for USD to EUR?",
]
functions = [get_weather, calculate_mortgage_payment, get_article_details]

for prompt in prompts:
    print(generate_hermes_prompt(prompt, functions))
```

--------------------------------

### Windows Installation with w64devkit

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Troubleshoot Windows installation errors by setting CMAKE_GENERATOR and CMAKE_ARGS to include w64devkit paths for GCC compilers.

```powershell
$env:CMAKE_GENERATOR = "MinGW Makefiles"
$env:CMAKE_ARGS = "-DGGML_OPENBLAS=on -DCMAKE_C_COMPILER=C:/w64devkit/bin/gcc.exe -DCMAKE_CXX_COMPILER=C:/w64devkit/bin/g++.exe"
```

--------------------------------

### Install Pre-built CUDA Wheel

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Install a pre-built wheel for CUDA support. Replace <cuda-version> with the appropriate CUDA version identifier (e.g., cu118, cu121).

```bash
pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/<cuda-version>
```

```bash
pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
```

--------------------------------

### Install Miniforge and Create Conda Environment

Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md

Downloads and installs the Miniforge distribution for MacOS ARM64 and creates a dedicated Python 3.9.16 environment for the project.

```bash
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh
conda create -n llama python=3.9.16
conda activate llama
```

--------------------------------

### Install Pre-built Metal Wheel

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Install a pre-built wheel for Metal support on macOS. Ensure your system meets the macOS and Python version requirements.

```bash
pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal
```

--------------------------------

### Install with CUDA

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Install with CUDA support by setting the GGML_CUDA environment variable. Ensure your system meets the CUDA and Python version requirements.

```bash
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
```

--------------------------------

### Python Basic Instructor Usage

Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Functions.ipynb

Demonstrates the basic setup for using the instructor library with Pydantic models. This involves importing necessary classes and defining a Pydantic model for structured output.

```python
import instructor
from pydantic import BaseModel
```

--------------------------------

### Changelog Entry Examples

Source: https://github.com/abetlen/llama-cpp-python/blob/main/CONTRIBUTING.md

Examples of how to format changelog entries for pull requests, including the tag, scope, description, contributor, and issue number.

```markdown
- feat(server): add support for X by @contributor in #1234
- fix(ci): repair Y wheel builds by @contributor in #1234
```

--------------------------------

### Install Python for Apple Silicon (M1 Mac)

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Install a compatible Python version for Apple Silicon (M1) Macs to avoid performance issues. Use Miniforge for arm64 architecture.

```bash
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh
```

--------------------------------

### Build and Start Open-Llama-in-a-box

Source: https://github.com/abetlen/llama-cpp-python/blob/main/docker/README.md

Automated scripts to download a 3B parameter Open LLaMA model and launch an OpenBLAS-enabled server container.

```bash
cd ./open_llama
./build.sh
./start.sh
```

--------------------------------

### Backend-Specific Build Targets

Source: https://github.com/abetlen/llama-cpp-python/blob/main/CONTRIBUTING.md

Examples of make targets for building the project with specific native acceleration backends like OpenBLAS, CUDA, Metal, or Vulkan.

```bash
make build.openblas
make build.cuda
make build.metal
make build.vulkan
```

--------------------------------

### Run server for function calling with Hugging Face tokenizer

Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/server.md

Start the server for function calling, specifying the model path and the path to the Hugging Face tokenizer.

```bash
python3 -m llama_cpp.server --model <model_path_to_functionary_v2_model> --chat_format functionary-v2 --hf_pretrained_model_name_or_path <model_path_to_functionary_v2_tokenizer>
```

--------------------------------

### Install llama-cpp-python with Metal Support

Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md

Uninstalls existing versions and installs the latest llama-cpp-python with the GGML_METAL flag enabled to ensure GPU acceleration is compiled.

```bash
pip uninstall llama-cpp-python -y
CMAKE_ARGS="-DGGML_METAL=on" pip install -U llama-cpp-python --no-cache-dir
pip install 'llama-cpp-python[server]'
```

--------------------------------

### Deploy GGUF Model with Ray Serve

Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/ray/README.md

Starts the Ray Serve application to host a GGUF model at a local API endpoint.

```bash
serve run llm:llm_builder model_path='../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf'
```

--------------------------------

### Run Llama.cpp Server with Multimodal Model

Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/server.md

Use this command to start the server with a multimodal model. Ensure you specify the paths for both the main model and the clip model, and set the correct chat format for llava-1.5.

```bash
python3 -m llama_cpp.server --model <model_path> --clip_model_path <clip_model_path> --chat_format llava-1-5
```

--------------------------------

### Download Llama Model from Hugging Face Hub

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Shows how to download a Llama model in GGUF format directly from Hugging Face using the `from_pretrained` method. Requires the `huggingface-hub` package to be installed.

```python
llm = Llama.from_pretrained(
    repo_id="lmstudio-community/Qwen3.5-0.8B-GGUF",
    filename="*Q8_0.gguf",
    verbose=False
)
```

--------------------------------

### Tokenize Prompt with Low-Level API

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Demonstrates how to use the low-level ctypes binding to tokenize a prompt using the llama.cpp C API. Ensure llama_backend_init() is called once at the start.

```python
import llama_cpp
import ctypes
llama_cpp.llama_backend_init()  # Must be called once at the start of each program
model_params = llama_cpp.llama_model_default_params()
ctx_params = llama_cpp.llama_context_default_params()
prompt = b"Q: Name the planets in the solar system? A: "
# use bytes for char * params
model = llama_cpp.llama_model_load_from_file(b"./models/7b/llama-model.gguf", model_params)
ctx = llama_cpp.llama_init_from_model(model, ctx_params)
vocab = llama_cpp.llama_model_get_vocab(model)
max_tokens = ctx_params.n_ctx
# use ctypes arrays for array params
tokens = (llama_cpp.llama_token * int(max_tokens))()
n_tokens = llama_cpp.llama_tokenize(vocab, prompt, len(prompt), tokens, max_tokens, True, False)
lama_cpp.llama_free(ctx)
lama_cpp.llama_model_free(model)
```

--------------------------------

### Install llama-cpp-python for M Series Mac

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Troubleshoot architecture compatibility errors on M Series Macs by specifying arm64 architecture and enabling Metal support.

```bash
CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DGGML_METAL=on" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python
```

--------------------------------

### Langchain OpenAI LLM Interaction

Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Clients.ipynb

This example demonstrates using Langchain's OpenAI LLM wrapper to generate text. It involves setting the OPENAI_API_KEY and OPENAI_API_BASE environment variables. The `OpenAI` class is instantiated, and then called directly with a prompt and stop sequences to generate text. The output is a string containing the generated text.

```python
import os

os.environ["OPENAI_API_KEY"] = (
    "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"  # can be anything
)
os.environ["OPENAI_API_BASE"] = "http://100.64.159.73:8000/v1"

from langchain.llms import OpenAI

llms = OpenAI()
llms(
    prompt="The quick brown fox jumps",
    stop=[ ".", "\n"],
)
```

--------------------------------

### Configure Build with Pip CLI

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Configure the llama.cpp build with specific CMake arguments using the pip install command with the -C flag.

```bash
pip install --upgrade pip # ensure pip is up to date
pip install llama-cpp-python \
  -C cmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS"
```

--------------------------------

### Moondream2 Multi-modal Chat Completion from Hub

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Example of loading a Moondream2 model from Hugging Face Hub using from_pretrained and MoondreamChatHandler. Adjust filenames and ensure n_ctx is sufficient for image data.

```python
from llama_cpp import Llama
from llama_cpp.llama_chat_format import MoondreamChatHandler

chat_handler = MoondreamChatHandler.from_pretrained(
  repo_id="vikhyatk/moondream2",
  filename="*mmproj*",
)

llm = Llama.from_pretrained(
  repo_id="vikhyatk/moondream2",
  filename="*text-model*",
  chat_handler=chat_handler,
  n_ctx=2048, # n_ctx should be increased to accommodate the image embedding
)

response = llm.create_chat_completion(
    messages = [
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } } 

            ]
        }
    ]
)
print(response["choices"][0]["text"])
```

--------------------------------

### Display server help and options

Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/server.md

View all available command-line options for configuring the server by running the help command.

```bash
python3 -m llama_cpp.server --help
```

--------------------------------

### Install with Metal (MPS)

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Install with Metal (MPS) support for macOS. Set the GGML_METAL environment variable before installation.

```bash
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python
```

--------------------------------

### CMake: Install Target Function for llama-cpp-python

Source: https://github.com/abetlen/llama-cpp-python/blob/main/CMakeLists.txt

Defines a reusable CMake function to install targets (libraries) for the llama-cpp-python project. It handles installation to both the source directory and the platform-specific library directory, setting RPATH properties for dynamic linking.

```cmake
function(llama_cpp_python_install_target target)
    if(NOT TARGET ${target})
        return()
    endif()

    install(
        TARGETS ${target}
        LIBRARY DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib
        RUNTIME DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib
        ARCHIVE DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib
        FRAMEWORK DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib
        RESOURCE DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib
    )
    install(
        TARGETS ${target}
        LIBRARY DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib
        RUNTIME DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib
        ARCHIVE DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib
        FRAMEWORK DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib
        RESOURCE DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib
    )
    set_target_properties(${target} PROPERTIES
        INSTALL_RPATH "$ORIGIN"
        BUILD_WITH_INSTALL_RPATH TRUE
    )
    if(UNIX)
        if(APPLE)
            set_target_properties(${target} PROPERTIES
                INSTALL_RPATH "@loader_path"
                BUILD_WITH_INSTALL_RPATH TRUE
            )
        else()
            set_target_properties(${target} PROPERTIES
                INSTALL_RPATH "$ORIGIN"
                BUILD_WITH_INSTALL_RPATH TRUE
            )
        endif()
    endif()
endfunction()
```

--------------------------------

### Typical Development Workflow

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Common Makefile targets for building and testing the project during development.

```bash
make build
make test
```

--------------------------------

### Run Web Server with ChatML Prompt Format

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Run the web server and specify the prompt format, such as `chatml`, to ensure correct prompt formatting for the model.

```bash
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --chat_format chatml
```

--------------------------------

### Install Dependencies for Llama_cpp Python

Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/low_level_api/readme/low_level_api_llama_cpp.md

Installs the necessary Python packages for using the llama_cpp library, including llama-cpp-python, ctypes, os, and multiprocessing.

```bash
python -m pip install llama-cpp-python ctypes os multiprocessing
```

--------------------------------

### Use Guidance Library for Programmatic Text Generation in Python

Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Guidance.ipynb

Demonstrates how to use the guidance library to define and execute a program that adapts a proverb. It shows setting up the language model, defining a guidance program with placeholders and generation commands, and executing it with specific inputs.

```python
import guidance

# set the default language model used to execute guidance programs
guidance.llm = guidance.llms.OpenAI("text-davinci-003", caching=False)

# define a guidance program that adapts a proverb
program = guidance(
    """Tweak this proverb to apply to model instructions instead.

{{proverb}}
- {{book}} {{chapter}}:{{verse}}

UPDATED
Where there is no guidance{{gen 'rewrite' stop="\n-"}}
- GPT {{gen 'chapter'}}:{{gen 'verse'}}"""
)

# execute the program on a specific proverb
executed_program = program(
    proverb="Where there is no guidance, a people falls,\nbut in an abundance of counselors there is safety.",
    book="Proverbs",
    chapter=11,
    verse=14,
)
```

--------------------------------

### Run Web Server with Custom Host and Port

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Run the web server, binding to a specific host (e.g., `0.0.0.0` for remote connections) and port.

```bash
python3 -m llama_cpp.server --host 0.0.0.0 --port 8000
```

--------------------------------

### Run llama-cpp-python API Server

Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md

Configures the environment variable for the model path and launches the API server with GPU layer offloading enabled.

```bash
export MODEL=[path to your llama.cpp ggml models]/[ggml-model-name]Q4_0.gguf
python3 -m llama_cpp.server --model $MODEL --n_gpu_layers 1
```

--------------------------------

### Run Web Server using Docker

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Run the llama-cpp-python web server using a Docker container, mapping ports and mounting a volume for models.

```bash
docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest
```

--------------------------------

### Install Runtime DLLs for ggml Library (CMake)

Source: https://github.com/abetlen/llama-cpp-python/blob/main/CMakeLists.txt

This CMake code snippet installs the runtime DLLs for the 'ggml' target. It specifies the destination directories for these files, ensuring they are available in the build environment.

```cmake
install(
            FILES $<TARGET_RUNTIME_DLLS:ggml>
            DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib
        )
        install(
            FILES $<TARGET_RUNTIME_DLLS:ggml>
            DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib
        )
```

--------------------------------

### Llama Class Initialization and Text Completion

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Demonstrates how to initialize the Llama class with a model path and perform basic text completion. It also shows the expected output format for text completions.

```APIDOC
## Llama Class Initialization and Text Completion

### Description
Initializes the `Llama` class with a specified model path and performs text completion. The example shows how to set parameters like `max_tokens`, `stop` sequences, and `echo`.

### Method
`Llama(model_path: str, ...)` for initialization, `llm(prompt: str, ...)` for text completion.

### Parameters (Initialization)
- **model_path** (str) - Required - Path to the GGUF model file.
- **n_gpu_layers** (int) - Optional - Number of layers to offload to the GPU.
- **seed** (int) - Optional - Seed for reproducibility.
- **n_ctx** (int) - Optional - Context window size.

### Parameters (Text Completion)
- **prompt** (str) - Required - The input prompt for text generation.
- **max_tokens** (int) - Optional - Maximum number of tokens to generate.
- **stop** (list[str]) - Optional - Sequences that will cause the generation to stop.
- **echo** (bool) - Optional - Whether to echo the prompt in the output.

### Request Example (Initialization and Call)
```python
from llama_cpp import Llama

llm = Llama(
      model_path="./models/7B/llama-model.gguf",
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
)
print(output)
```

### Response Example (Text Completion)
```json
{
  "id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "object": "text_completion",
  "created": 1679561337,
  "model": "./models/7B/llama-model.gguf",
  "choices": [
    {
      "text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 28,
    "total_tokens": 42
  }
}
```
```

--------------------------------

### Text Completion

Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/PerformanceTuning.ipynb

Shows an example of a text completion request and its response.

```APIDOC
## Text Completion API

### Description
This API endpoint is used to generate text completions based on a given prompt. It utilizes a pre-loaded Llama model to produce coherent and contextually relevant text.

### Method
POST

### Endpoint
`/v1/completions` (Hypothetical endpoint for demonstration)

### Parameters
#### Request Body
- **model** (string) - Required - The name or path of the model to use for completion.
- **prompt** (string) - Required - The input text prompt to generate a completion for.
- **max_tokens** (integer) - Optional - The maximum number of tokens to generate.
- **temperature** (number) - Optional - Controls randomness. Lower values make output more deterministic.

### Request Example
```json
{
  "model": "../models/ggml-model.bin",
  "prompt": "### Instructions:\nYou are a helpful assistant.\nYou answer questions truthfully and politely.\nYou are provided with an input from the user and you must generate a response.\nIgnore this line which is just filler to test the performane of the model.\n### Inputs:\nWhat is the capital of France?\n### Response:\nThe",
  "max_tokens": 10
}
```

### Response
#### Success Response (200)
- **id** (string) - Unique identifier for the completion.
- **object** (string) - Type of object returned (e.g., "text_completion").
- **created** (integer) - Timestamp of creation.
- **model** (string) - The model used for completion.
- **choices** (array) - An array of completion choices.
  - **text** (string) - The generated text completion.
  - **index** (integer) - Index of the choice.
  - **logprobs** (null) - Placeholder for log probabilities (currently null).
  - **finish_reason** (string) - Reason for finishing the generation (e.g., "length").
- **usage** (object) - Information about token usage.
  - **prompt_tokens** (integer) - Number of tokens in the prompt.
  - **completion_tokens** (integer) - Number of tokens in the completion.
  - **total_tokens** (integer) - Total tokens used.

#### Response Example
```json
{
  "id": "cmpl-e623667d-d6cc-4908-a648-60380f723592",
  "object": "text_completion",
  "created": 1680227881,
  "model": "../models/ggml-model.bin",
  "choices": [
    {
      "text": " capital of France is Paris.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 79,
    "completion_tokens": 6,
    "total_tokens": 85
  }
}
```
```

--------------------------------

### Text Completion

Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/PerformanceTuning.ipynb

Shows an example of generating text completions using a loaded Llama model.

```APIDOC
## Text Completion API

### Description
This endpoint allows users to generate text completions based on a given prompt. It utilizes a pre-loaded Llama model to produce relevant and coherent text.

### Method
POST

### Endpoint
`/v1/completions`

### Parameters
#### Query Parameters
None

#### Request Body
- **model** (string) - Required - The ID or path of the model to use for completion.
- **prompt** (string) - Required - The input text prompt for the model.
- **max_tokens** (integer) - Optional - The maximum number of tokens to generate.
- **temperature** (number) - Optional - Controls randomness. Lower values make output more deterministic.

### Request Example
```json
{
  "model": "../models/ggml-model.bin",
  "prompt": "### Instructions:\nYou are a helpful assistant.\nYou answer questions truthfully and politely.\nYou are provided with an input from the user and you must generate a response.\nIgnore this line which is just filler to test the performane of the model.\n### Inputs:\nWhat is the capital of France?\n### Response:\nThe",
  "max_tokens": 10
}
```

### Response
#### Success Response (200)
- **id** (string) - Unique identifier for the completion.
- **object** (string) - Type of object returned (e.g., "text_completion").
- **created** (integer) - Timestamp of creation.
- **model** (string) - The model used for completion.
- **choices** (array) - An array of completion choices.
  - **text** (string) - The generated text completion.
  - **index** (integer) - The index of the choice.
  - **logprobs** (null) - Log probabilities (currently null).
  - **finish_reason** (string) - The reason the generation finished (e.g., "length").
- **usage** (object) - Usage statistics.
  - **prompt_tokens** (integer) - Number of tokens in the prompt.
  - **completion_tokens** (integer) - Number of tokens generated.
  - **total_tokens** (integer) - Total tokens used.

#### Response Example
```json
{
  "id": "cmpl-f8d90e63-4939-491c-9775-fc15aa55505e",
  "object": "text_completion",
  "created": 1680228062,
  "model": "../models/ggml-model.bin",
  "choices": [
    {
      "text": " ### Instructions:\nYou are a helpful assistant.\nYou answer questions truthfully and politely.\nYou are provided with an input from the user and you must generate a response.\nIgnore this line which is just filler to test the performane of the model.\n### Inputs:\nWhat is the capital of France?\n### Response:\nThe",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 79,
    "completion_tokens": 1,
    "total_tokens": 80
  }
}
```
```

--------------------------------

### Function Calling with OpenAI Python Client

Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Functions.ipynb

Demonstrates how to set up the OpenAI client to interact with a running llama-cpp-python server for function calling.

```APIDOC
## Function Calling with OpenAI Python Client

### Description
This section provides a basic demonstration of setting up the OpenAI Python Client to communicate with a `llama-cpp-python` server that supports function calling. It includes the necessary imports and client initialization.

### Method
N/A (Client Setup)

### Endpoint
N/A (Client Setup)

### Parameters
N/A

### Request Example
```python
import openai
import json


client = openai.OpenAI(
    api_key="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",  # can be anything
    base_url="http://100.64.159.73:8000/v1",  # NOTE: Replace with IP address and port of your llama-cpp-python server
)


# Example dummy function hard coded to return the same weather
```

### Response
N/A (Client Setup)

### Response Example
N/A (Client Setup)
```

--------------------------------

### Load Model from Hugging Face Hub

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Run the web server and load a model from the Hugging Face Hub using the `--hf_model_repo_id` flag and a model file pattern.

```bash
python3 -m llama_cpp.server --hf_model_repo_id lmstudio-community/Qwen3.5-0.8B-GGUF --model '*Q8_0.gguf'
```

--------------------------------

### Low-Level API - LLAMA constants

Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/api-reference.md

Exposes low-level Python bindings for llama.cpp using ctypes, filtering constants starting with `LLAMA_`.

```APIDOC
## Low-Level API - LLAMA constants

This section exposes low-level Python bindings for llama.cpp using Python's ctypes library. It includes constants that start with `LLAMA_`.
```

--------------------------------

### Low-Level API - llama_cpp functions

Source: https://github.com/abetlen/llama-cpp-python/blob/main/docs/api-reference.md

Exposes low-level Python bindings for llama.cpp using ctypes, filtering functions starting with `llama_`.

```APIDOC
## Low-Level API - llama_cpp functions

This section exposes low-level Python bindings for llama.cpp using Python's ctypes library. It includes functions that start with `llama_`.
```

--------------------------------

### Initialize Llama Backend

Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Batching.ipynb

Initialize the Llama backend. Set numa to False if NUMA is not required or should be disabled.

```python
llama_cpp.llama_backend_init(numa=False)
```

--------------------------------

### Basic Text Completion with Llama Class

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Demonstrates how to perform basic text completion using the Llama class. Configure model path, GPU layers, seed, and context window. Control generation with max tokens, stop sequences, and echo prompt.

```python
from llama_cpp import Llama

llm = Llama(
      model_path="./models/7B/llama-model.gguf",
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
```

--------------------------------

### Model Loading and Initialization

Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/PerformanceTuning.ipynb

Demonstrates the process of loading a Llama model using the llama_model_load function and initializing the context.

```APIDOC
## Model Loading

### Description
Loads a Llama model from a specified file path and initializes the necessary structures for inference.

### Method
Internal Library Function

### Endpoint
N/A (Local Library Function)

### Parameters
#### Path Parameters
- **model_path** (string) - Required - The file path to the GGML model file.

### Request Example
```
llama_model_load(model_path='../models/ggml-model.bin')
```

### Response
#### Success Response (200)
- **model_info** (object) - Contains details about the loaded model, such as vocabulary size, context length, embedding dimensions, etc.
- **context_info** (object) - Contains details about the initialized context, such as KV cache size.

#### Response Example
```json
{
  "model_info": {
    "n_vocab": 32000,
    "n_ctx": 512,
    "n_embd": 4096,
    "n_mult": 256,
    "n_head": 32,
    "n_layer": 32,
    "n_rot": 128,
    "f16": 2,
    "n_ff": 11008,
    "n_parts": 1,
    "type": 1
  },
  "context_info": {
    "ggml_map_size": "4017.70 MB",
    "ggml_ctx_size": "81.25 KB",
    "mem_required": "5809.78 MB (+ 1026.00 MB per state)",
    "kv_self_size": "256.00 MB"
  }
}
```
```

--------------------------------

### Llava1.5 Multi-modal Chat Completion

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Example of using Llava1.5ChatHandler for multi-modal chat completions. Ensure the clip_model_path is correctly set and n_ctx is increased for image embeddings.

```python
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler
chat_handler = Llava15ChatHandler(clip_model_path="path/to/llava/mmproj.bin")
llm = Llama(
  model_path="./path/to/llava/llama-model.gguf",
  chat_handler=chat_handler,
  n_ctx=2048, # n_ctx should be increased to accommodate the image embedding
)
llm.create_chat_completion(
    messages = [
        {"role": "system", "content": "You are an assistant who perfectly describes images."},
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } } 
            ]
        }
    ]
)
```

--------------------------------

### Initialize Sampler Chain

Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Batching.ipynb

Adds various samplers (top-k, top-p, temperature, distribution) to a sampler chain for controlling text generation. Ensure llama_cpp library is imported.

```python
llama_cpp.llama_sampler_chain_add(sampler_chain, llama_cpp.llama_sampler_init_top_k(40))
lama_cpp.llama_sampler_chain_add(sampler_chain, llama_cpp.llama_sampler_init_top_p(0.9, 1))
lama_cpp.llama_sampler_chain_add(sampler_chain, llama_cpp.llama_sampler_init_temp(0.4))
lama_cpp.llama_sampler_chain_add(sampler_chain, llama_cpp.llama_sampler_init_dist(1234))  # Final "dist" sampler
```

--------------------------------

### Speculative Decoding with Draft Model

Source: https://github.com/abetlen/llama-cpp-python/blob/main/README.md

Example of initializing Llama with a draft model for speculative decoding. The num_pred_tokens parameter can be tuned for performance based on the hardware (GPU vs CPU).

```python
from llama_cpp import Llama
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding

lama = Llama(
    model_path="path/to/model.gguf",
    draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
)
```

--------------------------------

### API Completion Response JSON

Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/PerformanceTuning.ipynb

Example of a successful text completion response from the llama-cpp-python API. It includes the completion ID, model path, generated text, and token usage statistics.

```json
{
  "id": "cmpl-bc5dc1ba-f7ce-441c-a558-5005f2fb89b9",
  "object": "text_completion",
  "created": 1680227366,
  "model": "../models/ggml-model.bin",
  "choices": [
    {
      "text": " ### Instructions:\nYou are a helpful assistant.\nYou answer questions truthfully and politely.\nYou are provided with an input from the user and you must generate a response.\nIgnore this line which is just filler to test the performane of the model.\n### Inputs:\nWhat is the capital of France?\n### Response:\nThe",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 79,
    "completion_tokens": 1,
    "total_tokens": 80
  }
}
```

--------------------------------

### Initialize OpenAI Client for llama-cpp-python Server

Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Functions.ipynb

This code snippet demonstrates how to initialize the OpenAI Python client to connect to a running llama-cpp-python server. It requires the server's IP address and port, and the API key can be any string.

```python
import openai
import json


client = openai.OpenAI(
    api_key="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",  # can be anything
    base_url="http://100.64.159.73:8000/v1",  # NOTE: Replace with IP address and port of your llama-cpp-python server
)


# Example dummy function hard coded to return the same weather

```

--------------------------------

### API Completion Response JSON

Source: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/PerformanceTuning.ipynb

Example of the JSON object returned by the API after a text completion request. It includes the completion ID, model path, generated text choices, and token usage metrics.

```json
{
  "id": "cmpl-2623073e-004f-4386-98e0-7e6ea617523a",
  "object": "text_completion",
  "created": 1680227558,
  "model": "../models/ggml-model.bin",
  "choices": [
    {
      "text": " ### Instructions:\nYou are a helpful assistant.\nYou answer questions truthfully and politely.\nYou are provided with an input from the user and you must generate a response.\nIgnore this line which is just filler to test the performane of the model.\n### Inputs:\nWhat is the capital of France?\n### Response:\nThe",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 79,
    "completion_tokens": 1,
    "total_tokens": 80
  }
}
```