### Configure Project with CMake

Source: https://github.com/google/gemma.cpp/blob/main/examples/hello_world/README.md

Use CMake to configure the build environment for the hello_world example. This command fetches libgemma from a git commit hash on GitHub.

```sh
cmake -B build
```

--------------------------------

### Install CMake and Visual Studio Build Tools on Windows

Source: https://github.com/google/gemma.cpp/blob/main/README.md

Use winget to install CMake and Visual Studio 2022 Build Tools with the Clang/LLVM C++ frontend for native Windows development.

```sh
winget install --id Kitware.CMake
winget install --id Microsoft.VisualStudio.2022.BuildTools --force --override "--passive --wait --add Microsoft.VisualStudio.Workload.VCTools;installRecommended --add Microsoft.VisualStudio.Component.VC.Llvm.Clang --add Microsoft.VisualStudio.Component.VC.Llvm.ClangToolset"
```

--------------------------------

### Start Interactive gemma CLI Session

Source: https://context7.com/google/gemma.cpp/llms.txt

Launch the interactive REPL for Gemma models. Specify tokenizer and weights files. Adjust verbosity for output.

```sh
# Start interactive session with Gemma 2 2B instruction-tuned (SFP weights)
./gemma \
  --tokenizer tokenizer.spm \
  --weights gemma2-2b-it-sfp.sbs

# With single-file format (post-2025, tokenizer embedded in weights)
./gemma --weights gemma2-2b-it-sfp-single.sbs

# Verbosity levels: 0=silent, 1=default UI, 2=debug info
./gemma --tokenizer tokenizer.spm --weights gemma2-2b-it-sfp.sbs --verbosity 0
```

--------------------------------

### Start gemma_api_server

Source: https://context7.com/google/gemma.cpp/llms.txt

Exposes a Google Gemini-compatible REST API over HTTP with support for non-streaming and SSE streaming endpoints. Requires tokenizer and weights files.

```bash
# Start the server
./build/gemma_api_server \
  --tokenizer path/to/tokenizer.spm \
  --weights path/to/gemma2-2b-it-sfp.sbs \
  --port 8080 \
  --model gemma3-4b
```

--------------------------------

### List Models Response (JSON)

Source: https://github.com/google/gemma.cpp/blob/main/API_SERVER_README.md

Example JSON response from the `/v1beta/models` endpoint, listing available Gemma models.

```json
{
  "models": [
    {
      "name": "models/gemma3-4b",
      "displayName": "Gemma3 4B", 
      "description": "Gemma3 4B model running locally"
    }
  ]
}
```

--------------------------------

### Start Local Gemma API Server

Source: https://github.com/google/gemma.cpp/blob/main/API_SERVER_README.md

Launch the local API server with specified tokenizer and weights files. The port can be customized, defaulting to 8080.

```bash
./build/gemma_api_server \
  --tokenizer path/to/tokenizer.spm \
  --weights path/to/model.sbs \
  --port 8080
```

--------------------------------

### Generate Content with Generation Configuration (Python)

Source: https://github.com/google/gemma.cpp/blob/main/API_SERVER_README.md

Example of generating content using the Python requests library, including generation configuration options.

```APIDOC
## POST /v1beta/models/gemma3-4b:generateContent (Python)

### Description
Generates content using the Gemma.cpp API with specified generation configurations.

### Method
POST

### Endpoint
/v1beta/models/gemma3-4b:generateContent

### Parameters
#### Request Body
- **contents** (array) - Required - The content to generate from.
  - **parts** (array) - Required - A list of content parts.
    - **text** (string) - Required - The input text.
- **generationConfig** (object) - Optional - Configuration for content generation.
  - **temperature** (number) - Optional - Controls randomness (0.0 to 2.0, default: 1.0).
  - **topK** (integer) - Optional - Top-K sampling parameter (default: 1).
  - **maxOutputTokens** (integer) - Optional - Maximum number of tokens to generate (default: 8192).

### Request Example
```python
import requests

response = requests.post('http://localhost:8080/v1beta/models/gemma3-4b:generateContent',
  json={
    'contents': [{'parts': [{'text': 'Explain quantum computing in simple terms'}]}],
    'generationConfig': {
      'temperature': 0.9,
      'topK': 1,
      'maxOutputTokens': 1024
    }
  }
)

result = response.json()
if 'candidates' in result and result['candidates']:
    text = result['candidates'][0]['content']['parts'][0]['text']
    print(text)
```

### Response
#### Success Response (200)
- **candidates** (array) - Contains the generated content.
  - **content** (object) - The generated content.
    - **parts** (array) - A list of content parts.
      - **text** (string) - The generated text.
```

--------------------------------

### Build the Hello World Executable

Source: https://github.com/google/gemma.cpp/blob/main/examples/hello_world/README.md

Navigate to the build directory and use 'make' to compile the hello_world executable. The -j flag can be used for parallel builds.

```sh
cd build
make hello_world
```

--------------------------------

### GET /v1beta/models

Source: https://context7.com/google/gemma.cpp/llms.txt

Returns the list of models served by the local API server.

```APIDOC
## GET /v1beta/models

### Description
Returns the list of models served by the local API server.

### Method
GET

### Endpoint
/v1beta/models

### Response
#### Success Response (200)
- **models** (array) - A list of available models.
  - **name** (string) - The name of the model.
  - **displayName** (string) - The display name of the model.
  - **description** (string) - A description of the model.

#### Response Example
```json
{
  "models": [{
    "name": "models/gemma3-4b",
    "displayName": "Gemma3 4B",
    "description": "Gemma3 4B model running locally"
  }]
}
```
```

--------------------------------

### Curl: List Models

Source: https://github.com/google/gemma.cpp/blob/main/API_SERVER_README.md

Use curl to send a GET request to the `/v1beta/models` endpoint to retrieve a list of available models.

```bash
curl http://localhost:8080/v1beta/models
```

--------------------------------

### Generate Content with Conversation History (curl)

Source: https://github.com/google/gemma.cpp/blob/main/API_SERVER_README.md

Example of how to generate content using the /v1beta/models/gemma3-4b:generateContent endpoint with conversation history.

```APIDOC
## POST /v1beta/models/gemma3-4b:generateContent

### Description
Generates content based on the provided model and conversation history.

### Method
POST

### Endpoint
/v1beta/models/gemma3-4b:generateContent

### Request Body
- **contents** (array) - Required - An array of content parts, including conversation history.
  - **parts** (array) - Required - A list of content parts.
    - **text** (string) - Required - The text content.

### Request Example
```json
{
  "contents": [
    {"parts": [{"text": "Hi, my name is Alice"}]},
    {"parts": [{"text": "Hello Alice! Nice to meet you."}]},
    {"parts": [{"text": "What is my name?"}]}
  ]
}
```
```

--------------------------------

### Run Hello World Executable

Source: https://github.com/google/gemma.cpp/blob/main/examples/hello_world/README.md

Execute the compiled hello_world program, providing paths to the tokenizer, weights file, and specifying the model type.

```sh
./hello_world --tokenizer tokenizer.spm --weights 2b-it-sfp.sbs --model 2b-it
```

--------------------------------

### Generate Content Request (JSON)

Source: https://github.com/google/gemma.cpp/blob/main/API_SERVER_README.md

Example JSON payload for the `generateContent` endpoint. Includes input text and generation configuration.

```json
{
  "contents": [
    {
      "parts": [
        {"text": "Why is the sky blue?"}
      ]
    }
  ],
  "generationConfig": {
    "temperature": 0.9,
    "topK": 1,
    "maxOutputTokens": 1024
  }
}
```

--------------------------------

### Python API via requests

Source: https://context7.com/google/gemma.cpp/llms.txt

Example of using the Python `requests` library to interact with the Gemma API for non-streaming content generation.

```APIDOC
## Python API via `requests` (Full Example)

### Description
This example demonstrates how to use the Python `requests` library to perform non-streaming content generation with the Gemma API.

### Code Example
```python
import requests

BASE = "http://localhost:8080"
MODEL = "gemma3-4b"

def generate(prompt, temperature=0.9, top_k=1, max_tokens=1024):
    """Non-streaming generation."""
    resp = requests.post(
        f"{BASE}/v1beta/models/{MODEL}:generateContent",
        json={
            "contents": [{"parts": [{"text": prompt}]}],
            "generationConfig": {
                "temperature": temperature,
                "topK": top_k,
                "maxOutputTokens": max_tokens,
            },
        },
    )
    resp.raise_for_status()
    data = resp.json()
    if "candidates" in data and data["candidates"]:
        return data["candidates"][0]["content"]["parts"][0]["text"]
    raise RuntimeError(f"Unexpected response: {data}")

print(generate("What are the three laws of thermodynamics?"))
# Output: The three laws of thermodynamics are:
# 1. Energy cannot be created or destroyed...
```
```

--------------------------------

### Run Hello World with Constrained Decoding

Source: https://github.com/google/gemma.cpp/blob/main/examples/hello_world/README.md

Execute the hello_world program with the --reject flag to prevent specific words (identified by token IDs) from being generated. This flag must be the last one in the command.

```sh
./hello_world [...] --reject 32338 42360 78107 106837 132832 143859 154230 190205
```

--------------------------------

### Stream Generate Content Response (SSE)

Source: https://github.com/google/gemma.cpp/blob/main/API_SERVER_README.md

Example Server-Sent Events (SSE) response for the `streamGenerateContent` endpoint, showing incremental output.

```text
data: {"candidates":[{"content":{"parts":[{"text":"The"}],"role":"model"},"index":0}],"promptFeedback":{"safetyRatings":[]}}

data: {"candidates":[{"content":{"parts":[{"text":" sky"}],"role":"model"},"index":0}],"promptFeedback":{"safetyRatings":[]}}

data: [DONE]
```

--------------------------------

### Build gemma Executable with Bazel

Source: https://github.com/google/gemma.cpp/blob/main/README.md

Build the gemma executable using Bazel with optimized settings and C++20 standard.

```bash
bazel build -c opt --cxxopt=-std=c++20 :gemma
```

--------------------------------

### Build gemma Executable with CMake Presets (Windows)

Source: https://github.com/google/gemma.cpp/blob/main/README.md

Configure the build directory using the Windows CMake preset and then build the project using Visual Studio Build Tools. Specify the number of parallel threads.

```bash
# Configure `build` directory
cmake --preset windows

# Build project using Visual Studio Build Tools
cmake --build --preset windows -j [number of parallel threads to use]
```

--------------------------------

### Build gemma Executable with CMake Presets (Unix-like)

Source: https://github.com/google/gemma.cpp/blob/main/README.md

Configure the build directory using a CMake preset and then build the project. Specify the number of parallel threads for faster compilation.

```bash
# Configure `build` directory
cmake --preset make

# Build project using make
cmake --build --preset make -j [number of parallel threads to use]
```

--------------------------------

### Generate Content Response (JSON)

Source: https://github.com/google/gemma.cpp/blob/main/API_SERVER_README.md

Example JSON response from the `generateContent` endpoint, detailing the model's output, finish reason, and token usage.

```json
{
  "candidates": [
    {
      "content": {
        "parts": [
          {"text": "The sky appears blue because..."}
        ],
        "role": "model"
      },
      "finishReason": "STOP",
      "index": 0
    }
  ],
  "promptFeedback": {
    "safetyRatings": []
  },
  "usageMetadata": {
    "promptTokenCount": 5,
    "candidatesTokenCount": 25,
    "totalTokenCount": 30
  }
}
```

--------------------------------

### Build gemma.cpp with CMake (Unix)

Source: https://context7.com/google/gemma.cpp/llms.txt

Configure and build the gemma CLI binary or libgemma static library using CMake presets. Use -j flag for parallel builds.

```sh
git clone https://github.com/google/gemma.cpp
cd gemma.cpp

# Configure and build (Release)
cmake --preset make
cmake --build --preset make -j$(nproc)

# Build only the library target
cmake -B build
cd build && make -j$(nproc) libgemma

# Build the API server and client
cmake --build build --target gemma_api_server gemma_api_client -j8
```

--------------------------------

### Use Unified Client with Public Google API

Source: https://github.com/google/gemma.cpp/blob/main/API_SERVER_README.md

Connect to the public Google API using the unified client. Requires setting the GOOGLE_API_KEY environment variable or passing it directly.

```bash
# Set API key and use public API
export GOOGLE_API_KEY="your-api-key-here"
./build/gemma_api_client --interactive 1

# Or pass API key directly
./build/gemma_api_client --api_key "your-api-key" --interactive 1
```

--------------------------------

### gemma_api_client CLI

Source: https://context7.com/google/gemma.cpp/llms.txt

The `gemma_api_client` binary provides a unified CLI for interacting with Gemma models, supporting both local and public APIs.

```APIDOC
## gemma_api_client — Unified CLI Client

### Description
The `gemma_api_client` binary connects to either the local `gemma_api_server` or the public Google Generative Language API using identical flags.

### Usage Examples

**Interactive chat with local server:**
```bash
./build/gemma_api_client --interactive 1 --host localhost --port 8080
```

**Single prompt with local server:**
```bash
./build/gemma_api_client --prompt "Summarize the French Revolution in 3 sentences."
```

**Use public Google Gemini API (requires API key):**
```bash
export GOOGLE_API_KEY="your-api-key"
./build/gemma_api_client --interactive 1
```

**Use public Google Gemini API with custom model and API key:**
```bash
./build/gemma_api_client \
  --api_key "your-api-key" \
  --model gemini-1.5-flash \
  --prompt "Write a limerick about compilers."
```
```

--------------------------------

### Use gemma CLI in a Shell Pipeline

Source: https://context7.com/google/gemma.cpp/llms.txt

Pipe code or text into the gemma CLI for analysis or processing. This example pipes C++ code into the aliased gemma2b command.

```sh
# Pipe code into gemma for analysis
cat configs.h | tail -n 35 | tr '\n' ' ' | \
  xargs -0 echo "What does this C++ code do: " | gemma2b
```

--------------------------------

### Build Project with Make

Source: https://github.com/google/gemma.cpp/blob/main/examples/simplified_gemma/README.md

Build the simplified_gemma executable using make after configuring the project. The -j flag can be used for parallel builds.

```sh
cd build
make simplified_gemma
```

--------------------------------

### Build Gemma API Server and Client

Source: https://github.com/google/gemma.cpp/blob/main/API_SERVER_README.md

Configure and build the Gemma API server and client binaries using CMake. Ensure the build type is set to Release for optimal performance.

```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release

cmake --build build --target gemma_api_server gemma_api_client -j 8
```