### Basic Llama Model Inference Example in Java

Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md

Demonstrates a complete conversational example using LlamaModel. It sets up model parameters, including GPU layers, and engages in a loop where user input is processed and Llama's response is generated and displayed. The prompt is continuously updated to maintain context. Uses try-with-resources for automatic model cleanup.

```java
public class Example {

    public static void main(String... args) throws IOException {
        ModelParameters modelParams = new ModelParameters()
                .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf")
                .setGpuLayers(43);

        String system = "This is a conversation between User and Llama, a friendly chatbot.\n" +
                "Llama is helpful, kind, honest, good at writing, and never fails to answer any " +
                "requests immediately and with precision.\n";
        BufferedReader reader = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));
        try (LlamaModel model = new LlamaModel(modelParams)) {
            System.out.print(system);
            String prompt = system;
            while (true) {
                prompt += "\nUser: ";
                System.out.print("\nUser: ");
                String input = reader.readLine();
                prompt += input;
                System.out.print("Llama: ");
                prompt += "\nLlama: ";
                InferenceParameters inferParams = new InferenceParameters(prompt)
                        .setTemperature(0.7f)
                        .setPenalizeNl(true)
                        .setMiroStat(MiroStat.V2)
                        .setStopStrings("User:");
                for (LlamaOutput output : model.generate(inferParams)) {
                    System.out.print(output);
                    prompt += output;
                }
            }
        }
    }
}
```

--------------------------------

### Configuring Llama Model and Inference Parameters in Java

Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md

Illustrates how to configure `ModelParameters` and `InferenceParameters` using their builder patterns. This includes specifying the model path, adding LoRA adapters, setting grammar rules for controlled generation, and adjusting the temperature for creativity. The example uses try-with-resources for safe model loading and unloading.

```java
ModelParameters modelParams = new ModelParameters()
        .setModel("/path/to/model.gguf")
        .addLoraAdapter("/path/to/lora/adapter");
String grammar = """
		root  ::= (expr "=" term "\n")+
		expr  ::= term ([-+*/] term)*
		term  ::= [0-9]""";
InferenceParameters inferParams = new InferenceParameters("")
        .setGrammar(grammar)
        .setTemperature(0.8);
try (LlamaModel model = new LlamaModel(modelParams)) {
    model.generate(inferParams);
}
```

--------------------------------

### Stream Text Generation in Java using generate()

Source: https://context7.com/kherud/java-llama.cpp/llms.txt

Illustrates how to perform token-by-token text generation using the `generate()` method, which returns an `LlamaIterable`. This allows for real-time output display and mid-stream cancellation. The example sets up a basic chat loop, processes user input, and streams the model's response. It requires the de.kherud.llama and de.kherud.llama.args libraries, along with standard Java IO.

```java
import de.kherud.llama.*;
import de.kherud.llama.args.MiroStat;
import java.io.*;
import java.nio.charset.StandardCharsets;

public class StreamingChat {
    public static void main(String[] args) throws IOException {
        ModelParameters modelParams = new ModelParameters()
                .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf")
                .setGpuLayers(43);

        String systemPrompt = "This is a conversation between User and Assistant.\n" +
                "Assistant is helpful, kind, and honest.\n\n" +
                "User: Hello!\nAssistant: Hello! How can I help you today?";

        BufferedReader reader = new BufferedReader(
                new InputStreamReader(System.in, StandardCharsets.UTF_8));

        try (LlamaModel model = new LlamaModel(modelParams)) {
            String context = systemPrompt;

            while (true) {
                System.out.print("\nUser: ");
                String userInput = reader.readLine();
                if (userInput == null || userInput.equalsIgnoreCase("quit")) break;

                context += "\nUser: " + userInput + "\nAssistant: ";

                InferenceParameters inferParams = new InferenceParameters(context)
                        .setTemperature(0.7f)
                        .setPenalizeNl(true)
                        .setMiroStat(MiroStat.V2)
                        .setStopStrings("User:", "\n\n");

                System.out.print("Assistant: ");
                StringBuilder response = new StringBuilder();

                for (LlamaOutput output : model.generate(inferParams)) {
                    System.out.print(output);
                    response.append(output);
                }
                System.out.println();

                context += response.toString();
            }
        }
    }
}

```

--------------------------------

### Maven Dependency Configuration for java-llama.cpp

Source: https://context7.com/kherud/java-llama.cpp/llms.txt

Provides Maven dependency configurations for including the java-llama.cpp library in a Java project. It shows the basic CPU-only dependency and an example for enabling CUDA GPU support on Linux x86-64 systems.

```xml
<!-- Basic CPU-only dependency -->
<dependency>
    <groupId>de.kherud</groupId>
    <artifactId>llama</artifactId>
    <version>4.2.0</version>
</dependency>

<!-- For CUDA GPU support on Linux x86-64, add the classifier -->
<dependency>
    <groupId>de.kherud</groupId>
    <artifactId>llama</artifactId>
    <version>4.2.0</version>
    <classifier>cuda12-linux-x86-64</classifier>
</dependency>
```

--------------------------------

### Configure JNI Include Directories (CMake)

Source: https://github.com/kherud/java-llama.cpp/blob/master/CMakeLists.txt

Sets the include directories for JNI headers based on the detected OS. It handles Unix-like systems (Linux, macOS) and Windows separately. If not found, it attempts to locate them via the Java installation.

```cmake
if(NOT DEFINED JNI_INCLUDE_DIRS)
    if(OS_NAME MATCHES "^Linux" OR OS_NAME STREQUAL "Mac" OR OS_NAME STREQUAL "Darwin")
        set(JNI_INCLUDE_DIRS .github/include/unix)
    elseif(OS_NAME STREQUAL "Windows")
        set(JNI_INCLUDE_DIRS .github/include/windows)
    else()
        find_package(Java REQUIRED)
        find_program(JAVA_EXECUTABLE NAMES java)

        find_path(JNI_INCLUDE_DIRS NAMES jni.h HINTS ENV JAVA_HOME PATH_SUFFIXES include)

        file(GLOB_RECURSE JNI_MD_PATHS RELATIVE "${JNI_INCLUDE_DIRS}" "${JNI_INCLUDE_DIRS}/**/jni_md.h")
        foreach(PATH IN LISTS JNI_MD_PATHS)
            get_filename_component(DIR ${PATH} DIRECTORY)
            list(APPEND JNI_INCLUDE_DIRS "${JNI_INCLUDE_DIRS}/${DIR}")
        endforeach()
    endif()
endif()
if(NOT JNI_INCLUDE_DIRS)
    message(FATAL_ERROR "Could not determine JNI include directories")
endif()
```

--------------------------------

### Configure Logging in Java with Llama.cpp

Source: https://context7.com/kherud/java-llama.cpp/llms.txt

Illustrates how to configure logging for the Llama.cpp Java library. You can redirect logs to custom handlers, specify log formats (TEXT or JSON), and control verbosity. This is crucial for debugging and monitoring model behavior. The examples show setting loggers with different formats and disabling logging.

```java
import de.kherud.llama.*;
import de.kherud.llama.args.LogFormat;

// Redirect logs to custom handler (e.g., logging framework)
LlamaModel.setLogger(LogFormat.TEXT, (level, message) -> {
    switch (level) {
        case ERROR:
            System.err.println("[ERROR] " + message);
            break;
        case WARN:
            System.err.println("[WARN] " + message);
            break;
        case INFO:
            System.out.println("[INFO] " + message);
            break;
        case DEBUG:
            System.out.println("[DEBUG] " + message);
            break;
    }
});

// Use JSON format for structured logging
LlamaModel.setLogger(LogFormat.JSON, (level, message) -> {
    // message is already JSON formatted
    System.out.println(message);
});

// Log to stdout with different format (pass null callback)
LlamaModel.setLogger(LogFormat.TEXT, null);

// Disable logging completely
LlamaModel.setLogger(null, (level, message) -> {});

// Now load and use model
ModelParameters params = new ModelParameters()
        .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf")
        .enableLogTimestamps()
        .enableLogPrefix()
        .setLogVerbosity(1);  // 0 = minimal, higher = more verbose
```

--------------------------------

### Customizing Llama Model Logging in Java

Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md

Shows how to intercept and customize log messages from the LlamaModel. It demonstrates setting a custom logger for both text and JSON formats, redirecting logs to `System.out`, and disabling logging entirely. The examples cover changing log format while still writing to stdout, and passing a no-op callback to disable logging.

```java
// Re-direct log messages however you like (e.g. to a logging library)
LlamaModel.setLogger(LogFormat.TEXT, (level, message) -> System.out.println(level.name() + ": " + message));
// Log to stdout, but change the format
LlamaModel.setLogger(LogFormat.TEXT, null);
// Disable logging by passing a no-op
LlamaModel.setLogger(null, (level, message) -> {});
```

--------------------------------

### Convert JSON Schema to GBNF Grammar in Java

Source: https://context7.com/kherud/java-llama.cpp/llms.txt

This snippet shows how to convert a JSON schema into a GBNF grammar, enabling type-safe structured JSON output from the language model. It uses the static `jsonSchemaToGrammar` method from the `LlamaModel` class. The generated grammar can then be used with `InferenceParameters` to guide JSON generation. Dependencies include the de.kherud.llama library.

```java
import de.kherud.llama.*;

String jsonSchema = """
        {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "integer"},
                "email": {"type": "string"}
            },
            "required": ["name", "age"],
            "additionalProperties": false
        }
        """;

// Convert JSON schema to grammar
String grammar = LlamaModel.jsonSchemaToGrammar(jsonSchema);
System.out.println("Generated Grammar:\n" + grammar);

ModelParameters modelParams = new ModelParameters()
        .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf");

try (LlamaModel model = new LlamaModel(modelParams)) {
    InferenceParameters params = new InferenceParameters(
            "Generate a JSON object for a person named John who is 30 years old:")
            .setGrammar(grammar)
            .setNPredict(100);

    String jsonOutput = model.complete(params);
    System.out.println("JSON output: " + jsonOutput);
    // Output: {"name": "John", "age": 30}
}

```

--------------------------------

### Access Token Probabilities in Java

Source: https://context7.com/kherud/java-llama.cpp/llms.txt

Demonstrates how to access token probabilities during text generation in Java using the java-llama.cpp library. This allows for uncertainty estimation and analysis of the generated output. It requires the llama library and a pre-trained model.

```java
import de.kherud.llama.*;
import java.util.*;

ModelParameters modelParams = new ModelParameters()
        .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf");

try (LlamaModel model = new LlamaModel(modelParams)) {
    InferenceParameters params = new InferenceParameters("The capital of France is")
            .setNProbs(5)        // Return top 5 token probabilities
            .setTemperature(0.0f) // Deterministic for reproducibility
            .setNPredict(10);

    System.out.println("Generation with probabilities:");
    for (LlamaOutput output : model.generate(params)) {
        System.out.print(output.text);

        // Access probabilities for this token
        if (!output.probabilities.isEmpty()) {
            System.out.println("\n  Top tokens:");
            output.probabilities.entrySet().stream()
                    .sorted((a, b) -> Float.compare(b.getValue(), a.getValue()))
                    .limit(3)
                    .forEach(e -> System.out.printf("    '%s': %.4f%n",
                            e.getKey(), e.getValue()));
        }
    }
}
```

--------------------------------

### Configure Inference Parameters in Java

Source: https://context7.com/kherud/java-llama.cpp/llms.txt

Demonstrates how to set various parameters for text generation using the InferenceParameters class. This includes controlling token limits, sampling strategies (temperature, top-k, top-p), repetition penalties, MiroStat sampling, stop conditions, and token biases. It requires the de.kherud.llama and de.kherud.llama.args libraries.

```java
import de.kherud.llama.*;
import de.kherud.llama.args.*;
import java.util.*;

// Basic inference parameters
InferenceParameters params = new InferenceParameters("Write a haiku about Java programming.")
        // Token generation limits
        .setNPredict(100)           // Max tokens to generate (-1 = infinite)
        .setSeed(42)                // RNG seed for reproducibility

        // Temperature and sampling
        .setTemperature(0.7f)       // Creativity (0.0 = deterministic, 1.0+ = creative)
        .setTopK(40)                // Top-K sampling (0 = disabled)
        .setTopP(0.9f)              // Top-P/nucleus sampling (1.0 = disabled)
        .setMinP(0.05f)             // Min-P sampling
        .setTypicalP(1.0f)          // Locally typical sampling

        // Repetition control
        .setRepeatLastN(64)         // Tokens to consider for penalties
        .setRepeatPenalty(1.1f)     // Repetition penalty (1.0 = disabled)
        .setFrequencyPenalty(0.0f)  // Frequency penalty
        .setPresencePenalty(0.0f)   // Presence penalty
        .setPenalizeNl(true)        // Penalize newlines

        // MiroStat sampling
        .setMiroStat(MiroStat.V2)   // DISABLED, V1, or V2
        .setMiroStatTau(5.0f)       // Target entropy
        .setMiroStatEta(0.1f)       // Learning rate

        // Stop conditions
        .setStopStrings("User:", "\n\n", "###")

        // Caching
        .setCachePrompt(true);      // Cache prompt for reuse

// Sampler order customization
InferenceParameters customSamplers = new InferenceParameters("Hello")
        .setSamplers(Sampler.TOP_K, Sampler.TOP_P, Sampler.TEMPERATURE);

// Token bias - increase/decrease likelihood of specific tokens
Map<Integer, Float> tokenBias = new HashMap<>();
tokenBias.put(15043, 1.5f);   // Increase likelihood of token 15043
tokenBias.put(2, -1.0f);      // Decrease likelihood of token 2
InferenceParameters biasedParams = new InferenceParameters("Hello")
        .setTokenIdBias(tokenBias);

// Disable specific tokens
InferenceParameters noTokens = new InferenceParameters("Hello")
        .disableTokenIds(Arrays.asList(1, 2, 3));

```

--------------------------------

### Llama Model Inference and Embedding in Java

Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md

Shows basic inference tasks with LlamaModel. It demonstrates how to load a model, generate a response to a prompt, complete a response in one go, and generate an embedding for a given text. Emphasizes that LlamaModel is stateless and context must be managed by appending outputs to prompts. Uses try-with-resources for proper resource management.

```java
ModelParameters modelParams = new ModelParameters().setModel("/path/to/model.gguf");
InferenceParameters inferParams = new InferenceParameters("Tell me a joke.");
try (LlamaModel model = new LlamaModel(modelParams)) {
    // Stream a response and access more information about each output.
    for (LlamaOutput output : model.generate(inferParams)) {
        System.out.print(output);
    }
    // Calculate a whole response before returning it.
    String response = model.complete(inferParams);
    // Returns the hidden representation of the context + prompt.
    float[] embedding = model.embed("Embed this");
}
```

--------------------------------

### Java ModelParameters Configuration

Source: https://context7.com/kherud/java-llama.cpp/llms.txt

Illustrates the extensive configuration options available through the ModelParameters class for loading and initializing Llama models. This includes setting model sources, GPU offloading, threading, memory management, LoRA adapters, and sampling parameters.

```java
import de.kherud.llama.*;
import de.kherud.llama.args.*;

ModelParameters params = new ModelParameters()
        // Model source (choose one)
        .setModel("/path/to/model.gguf")
        // Or download from URL (requires -DLLAMA_CURL=ON during build)
        // .setModelUrl("https://huggingface.co/TheBloke/Mistral-7B-GGUF/resolve/main/mistral-7b.Q4_K_M.gguf")
        // Or from Hugging Face
        // .setHfRepo("TheBloke/Mistral-7B-GGUF")
        // .setHfFile("mistral-7b.Q4_K_M.gguf")

        // GPU and performance
        .setGpuLayers(35)           // Layers to offload to GPU (0 = CPU only)
        .setThreads(8)              // CPU threads for generation
        .setThreadsBatch(8)         // CPU threads for batch processing
        .setBatchSize(512)          // Logical batch size
        .setCtxSize(4096)           // Context window size

        // Memory and caching
        .enableFlashAttn()          // Enable Flash Attention
        .enableMlock()              // Lock model in RAM
        .setCacheTypeK(CacheType.F16)
        .setCacheTypeV(CacheType.F16)

        // LoRA adapters
        .addLoraAdapter("/path/to/lora-adapter.bin")
        .addLoraScaledAdapter("/path/to/another-lora.bin", 0.5f)

        // Sampling defaults
        .setTemp(0.8f)
        .setTopK(40)
        .setTopP(0.95f)
        .setRepeatPenalty(1.1f)

        // Special modes
        .enableEmbedding()          // Enable embedding generation
        .enableReranking()          // Enable document reranking

        // Logging
        .enableLogTimestamps()
        .enableLogPrefix();
```

--------------------------------

### Basic Java LlamaModel Usage

Source: https://context7.com/kherud/java-llama.cpp/llms.txt

Demonstrates the core usage of the LlamaModel class for loading a model and performing both streaming and non-streaming text generation. It highlights the use of try-with-resources for proper memory management of native resources.

```java
import de.kherud.llama.*;
import de.kherud.llama.args.MiroStat;

public class BasicUsage {
    public static void main(String[] args) {
        // Configure and load the model
        ModelParameters modelParams = new ModelParameters()
                .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf")
                .setGpuLayers(43)    // Number of layers to offload to GPU
                .setCtxSize(2048);   // Context window size

        try (LlamaModel model = new LlamaModel(modelParams)) {
            // Streaming generation
            InferenceParameters inferParams = new InferenceParameters("Tell me a joke about programming.")
                    .setTemperature(0.7f)
                    .setTopP(0.9f)
                    .setNPredict(100);

            System.out.print("Response: ");
            for (LlamaOutput output : model.generate(inferParams)) {
                System.out.print(output);
            }
            System.out.println();

            // Non-streaming completion
            String response = model.complete(inferParams);
            System.out.println("Complete response: " + response);
        }
    }
}
```

--------------------------------

### Compile llama.cpp Java Bindings

Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md

These shell commands demonstrate the process of compiling the llama.cpp Java bindings. It includes running Maven to compile the Java code, followed by CMake commands to configure and build the native libraries, with an option to enable CUDA support.

```shell
mvn compile
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
```

--------------------------------

### Perform Code Infilling in Java

Source: https://context7.com/kherud/java-llama.cpp/llms.txt

This snippet demonstrates code infilling, where the model generates code to fill the gap between a given prefix and suffix. This is useful for code completion and insertion tasks. It utilizes `setInputPrefix` and `setInputSuffix` within `InferenceParameters`. Dependencies include the de.kherud.llama library and a suitable code model like codellama.

```java
import de.kherud.llama.*;

ModelParameters modelParams = new ModelParameters()
        .setModel("models/codellama-7b.Q2_K.gguf")
        .setGpuLayers(43);

try (LlamaModel model = new LlamaModel(modelParams)) {
    String prefix = "def remove_non_ascii(s: str) -> str:\n    \"\"\" ";
    String suffix = "\n    return result\n    \"\"\";

    InferenceParameters params = new InferenceParameters("")
            .setInputPrefix(prefix)
            .setInputSuffix(suffix)
            .setTemperature(0.2f)
            .setNPredict(100)
            .setStopStrings("\"\"\");

    System.out.print(prefix);
    for (LlamaOutput output : model.generate(params)) {
        System.out.print(output);
    }
    System.out.print(suffix);

    // Expected output:
    // def remove_non_ascii(s: str) -> str:
    //     """ Remove non-ASCII characters from a string.
    // 
    //     Args:
    //         s: Input string
    // 
    //     Returns:
    //         String with only ASCII characters
    //     """
    //     result = ''.join(char for char in s if ord(char) < 128)
    //     return result
}

```

--------------------------------

### Format Chat Messages with Jinja Templates in Java

Source: https://context7.com/kherud/java-llama.cpp/llms.txt

Demonstrates how to use Jinja templating with the Llama.cpp Java library to format chat messages, including system prompts and conversation history. This ensures proper structure for the model's input. It requires enabling Jinja templating in ModelParameters and setting messages using `InferenceParameters.setMessages()`.

```java
import de.kherud.llama.*;
import java.util.*;

ModelParameters modelParams = new ModelParameters()
        .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf")
        .enableJinja();  // Enable Jinja templating

try (LlamaModel model = new LlamaModel(modelParams)) {
    // Build conversation with system message and history
    String systemMessage = "You are a helpful coding assistant.";

    List<Pair<String, String>> messages = new ArrayList<>();
    messages.add(new Pair<>("user", "What is a Python list comprehension?"));
    messages.add(new Pair<>("assistant", "A list comprehension is a concise way to create lists in Python."));
    messages.add(new Pair<>("user", "Show me an example."));

    InferenceParameters params = new InferenceParameters("")
            .setMessages(systemMessage, messages)
            .setTemperature(0.7f)
            .setNPredict(200);

    // Apply chat template to see formatted prompt
    String formattedPrompt = model.applyTemplate(params);
    System.out.println("Formatted prompt:\n" + formattedPrompt);
    // Output shows proper chat formatting:
    // <|im_start|>system
    // You are a helpful coding assistant.<|im_end|>
    // <|im_start|>user
    // What is a Python list comprehension?<|im_end|>
    // ...

    // Generate response
    System.out.println("\nResponse:");
    for (LlamaOutput output : model.generate(params)) {
        System.out.print(output);
    }
}
```

--------------------------------

### Fetch json Dependency (CMake)

Source: https://github.com/kherud/java-llama.cpp/blob/master/CMakeLists.txt

Fetches the nlohmann/json library from GitHub using FetchContent. This is a dependency for JSON handling within the project. It specifies the Git repository and a specific tag for version control.

```cmake
FetchContent_Declare(
	json
	GIT_REPOSITORY https://github.com/nlohmann/json
	GIT_TAG        v3.11.3
)
FetchContent_MakeAvailable(json)
```

--------------------------------

### Non-Streaming Text Completion with complete() in Java

Source: https://context7.com/kherud/java-llama.cpp/llms.txt

Shows how to use the `complete()` method for non-streaming text generation, where the entire response is returned at once. This is useful when the full output is needed before proceeding. Requires the `de.kherud.llama.*` library.

```java
import de.kherud.llama.*;

ModelParameters modelParams = new ModelParameters()
        .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf")
        .setGpuLayers(43);

try (LlamaModel model = new LlamaModel(modelParams)) {
    InferenceParameters params = new InferenceParameters(
            "Translate to French: Hello, how are you today?")
            .setTemperature(0.3f)
            .setNPredict(50)
            .setSeed(42);

    String response = model.complete(params);
    System.out.println("Translation: " + response);
    // Output: Translation: Bonjour, comment allez-vous aujourd'hui?
}
```

--------------------------------

### Generate Text Embeddings in Java

Source: https://context7.com/kherud/java-llama.cpp/llms.txt

Illustrates how to generate vector embeddings for text using embedding-capable models. This requires enabling embedding mode during model initialization and uses the `embed()` method. Requires the `de.kherud.llama.*` library.

```java
import de.kherud.llama.*;

ModelParameters modelParams = new ModelParameters()
        .setModel("models/nomic-embed-text-v1.5.Q4_K_M.gguf")
        .enableEmbedding();  // Required for embedding generation

try (LlamaModel model = new LlamaModel(modelParams)) {
    String text = "Machine learning is a subset of artificial intelligence.";

    float[] embedding = model.embed(text);

    System.out.println("Embedding dimensions: " + embedding.length);
    System.out.println("First 5 values: ");
    for (int i = 0; i < Math.min(5, embedding.length); i++) {
        System.out.printf("  [%d]: %.6f%n", i, embedding[i]);
    }

    // Calculate cosine similarity between two embeddings
    float[] embedding2 = model.embed("AI includes machine learning and deep learning.");
    double similarity = cosineSimilarity(embedding, embedding2);
    System.out.printf("Cosine similarity: %.4f%n", similarity);
}

static double cosineSimilarity(float[] a, float[] b) {
    double dotProduct = 0.0, normA = 0.0, normB = 0.0;
    for (int i = 0; i < a.length; i++) {
        dotProduct += a[i] * b[i];
        normA += a[i] * a[i];
        normB += b[i] * b[i];
    }
    return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
```

--------------------------------

### Tokenize Text with encode() and decode() in Java

Source: https://context7.com/kherud/java-llama.cpp/llms.txt

Demonstrates direct access to the model's tokenizer for converting text to token IDs (`encode()`) and token IDs back to text (`decode()`). This is useful for pre-generation token counting. Requires the `de.kherud.llama.*` library.

```java
import de.kherud.llama.*;
import java.util.Arrays;

ModelParameters modelParams = new ModelParameters()
        .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf");

try (LlamaModel model = new LlamaModel(modelParams)) {
    String text = "Hello, world! How are you?";

    // Encode text to token IDs
    int[] tokens = model.encode(text);
    System.out.println("Original text: " + text);
    System.out.println("Token count: " + tokens.length);
    System.out.println("Token IDs: " + Arrays.toString(tokens));

    // Decode token IDs back to text
    String decoded = model.decode(tokens);
    System.out.println("Decoded text: " + decoded);
    // Note: Llama tokenizer adds a space prefix, so decoded may have leading space

    // Useful for counting tokens before generation
    String longPrompt = "This is a very long prompt...";
    int tokenCount = model.encode(longPrompt).length;
    System.out.println("Prompt uses " + tokenCount + " tokens");
}
```

--------------------------------

### Rerank Documents by Relevance in Java

Source: https://context7.com/kherud/java-llama.cpp/llms.txt

Shows how to use the Llama.cpp Java library for document reranking. It allows you to score and sort documents based on their relevance to a given query. This feature requires enabling reranking in ModelParameters. The output provides raw relevance scores and a sorted list of documents.

```java
import de.kherud.llama.*;
import java.util.*;

ModelParameters modelParams = new ModelParameters()
        .setModel("models/jina-reranker-v1-tiny-en-Q4_0.gguf")
        .setCtxSize(512)
        .enableReranking();  // Required for reranking

try (LlamaModel model = new LlamaModel(modelParams)) {
    String query = "Machine learning applications";

    String[] documents = {
        "A machine is a physical system that uses power to perform actions.",
        "Learning is the process of acquiring new knowledge and skills.",
        "Machine learning is a field of AI that enables computers to learn from data.",
        "Paris is the capital of France and a major European city."
    };

    // Get raw reranking scores
    LlamaOutput output = model.rerank(query, documents);
    System.out.println("Relevance scores:");
    for (Map.Entry<String, Float> entry : output.probabilities.entrySet()) {
        System.out.printf("  %.4f: %s...%n",
                entry.getValue(),
                entry.getKey().substring(0, Math.min(50, entry.getKey().length())));
    }

    // Get sorted results (most relevant first)
    List<Pair<String, Float>> rankedDocs = model.rerank(true, query, documents);
    System.out.println("\nRanked documents (best first):");
    for (int i = 0; i < rankedDocs.size(); i++) {
        Pair<String, Float> doc = rankedDocs.get(i);
        System.out.printf("%d. [%.4f] %s...%n",
                i + 1,
                doc.getValue(),
                doc.getKey().substring(0, Math.min(50, doc.getKey().length())));
    }
    // Output:
    // 1. [0.9823] Machine learning is a field of AI that enables...
    // 2. [0.3421] Learning is the process of acquiring new know...
    // 3. [0.1234] A machine is a physical system that uses power...
    // 4. [0.0012] Paris is the capital of France and a major Eur...
}
```

--------------------------------

### Fetch llama.cpp Dependency (CMake)

Source: https://github.com/kherud/java-llama.cpp/blob/master/CMakeLists.txt

Fetches the llama.cpp library from its GitHub repository using FetchContent. This is the core C++ library for llama model inference. It specifies the Git repository and a specific commit hash for versioning.

```cmake
FetchContent_Declare(
	llama.cpp
	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
	GIT_TAG        b4916
)
FetchContent_MakeAvailable(llama.cpp)
```

--------------------------------

### Build jllama Shared Library (CMake)

Source: https://github.com/kherud/java-llama.cpp/blob/master/CMakeLists.txt

Defines and builds the 'jllama' shared library. It includes the main C++ source file and header files. It sets compiler features to C++11 and links against common, llama, and nlohmann_json libraries.

```cmake
add_library(jllama SHARED src/main/cpp/jllama.cpp src/main/cpp/server.hpp src/main/cpp/utils.hpp)

set_target_properties(jllama PROPERTIES POSITION_INDEPENDENT_CODE ON)
target_include_directories(jllama PRIVATE src/main/cpp ${JNI_INCLUDE_DIRS})
target_link_libraries(jllama PRIVATE common llama nlohmann_json)
target_compile_features(jllama PRIVATE cxx_std_11)

target_compile_definitions(jllama PRIVATE
    SERVER_VERBOSE=$<BOOL:${LLAMA_VERBOSE}>
)
```

--------------------------------

### Set Runtime Output Directory (CMake)

Source: https://github.com/kherud/java-llama.cpp/blob/master/CMakeLists.txt

Configures the output directory for the built 'jllama' library based on the operating system. For Windows, it sets specific debug, release, and relwithdebinfo directories. For other OS, it sets a general library output directory.

```cmake
if(OS_NAME STREQUAL "Windows")
    set_target_properties(jllama llama ggml PROPERTIES
	  RUNTIME_OUTPUT_DIRECTORY_DEBUG ${JLLAMA_DIR}
	  RUNTIME_OUTPUT_DIRECTORY_RELEASE ${JLLAMA_DIR}
	  RUNTIME_OUTPUT_DIRECTORY_RELWITHDEBINFO ${JLLAMA_DIR}
	)
else()
	set_target_properties(jllama llama ggml PROPERTIES
	  LIBRARY_OUTPUT_DIRECTORY ${JLLAMA_DIR}
	)
endif()
```

--------------------------------

### Constrain Model Output with BNF-like Grammar in Java

Source: https://context7.com/kherud/java-llama.cpp/llms.txt

This snippet demonstrates how to constrain model output to follow a BNF-like grammar, ensuring structured output formats like JSON, mathematical expressions, or custom patterns. It utilizes the LlamaModel class and InferenceParameters to set the grammar and generate text accordingly. Dependencies include the de.kherud.llama library.

```java
import de.kherud.llama.*;

ModelParameters modelParams = new ModelParameters()
        .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf");

try (LlamaModel model = new LlamaModel(modelParams)) {
    // Grammar for simple arithmetic expressions
    String arithmeticGrammar = """
            root  ::= (expr "=" term "\n")+ 
            expr  ::= term ([-+*/] term)*
            term  ::= [0-9]
            """;

    InferenceParameters params = new InferenceParameters(
            "Generate some arithmetic expressions:")
            .setGrammar(arithmeticGrammar)
            .setNPredict(50);

    System.out.println("Arithmetic expressions:");
    for (LlamaOutput output : model.generate(params)) {
        System.out.print(output);
    }
    // Output example:
    // 3+4=7
    // 9-2=7
    // 5*1=5

    // Grammar for yes/no answers only
    String yesNoGrammar = "root ::= (\"yes\" | \"no\")";
    InferenceParameters yesNoParams = new InferenceParameters(
            "Is the sky blue? Answer: ")
            .setGrammar(yesNoGrammar)
            .setNPredict(5);

    String answer = model.complete(yesNoParams);
    System.out.println("\nYes/No answer: " + answer);

    // Grammar for character sequences
    String abGrammar = "root ::= (\"a\" | \"b\")+";
    InferenceParameters abParams = new InferenceParameters("")
            .setGrammar(abGrammar)
            .setNPredict(20);

    String abSequence = model.complete(abParams);
    System.out.println("AB sequence: " + abSequence);  // e.g., "aababbaab"
}

```

--------------------------------

### Determine OS Architecture (CMake)

Source: https://github.com/kherud/java-llama.cpp/blob/master/CMakeLists.txt

Determines the CPU architecture by executing a Java class. This is necessary for selecting the correct pre-compiled binaries or optimizing builds. It relies on 'mvn compile' having been executed previously.

```cmake
if(NOT DEFINED OS_ARCH)
    find_package(Java REQUIRED)
    find_program(JAVA_EXECUTABLE NAMES java)
    execute_process(
      COMMAND ${JAVA_EXECUTABLE} -cp ${CMAKE_SOURCE_DIR}/target/classes de.kherud.llama.OSInfo --arch
      OUTPUT_VARIABLE OS_ARCH
      OUTPUT_STRIP_TRAILING_WHITESPACE
    )
endif()
if(NOT OS_ARCH)
    message(FATAL_ERROR "Could not determine CPU architecture")
endif()
```

--------------------------------

### Copy Metal Shader (CMake)

Source: https://github.com/kherud/java-llama.cpp/blob/master/CMakeLists.txt

Conditionally copies the 'ggml-metal.metal' shader file to the output directory when building with Metal support and not embedding the library. This ensures the Metal shader is available at runtime for macOS GPU acceleration.

```cmake
if (LLAMA_METAL AND NOT LLAMA_METAL_EMBED_LIBRARY)
    configure_file(${llama.cpp_SOURCE_DIR}/ggml-metal.metal ${JLLAMA_DIR}/ggml-metal.metal COPYONLY)
endif()
```

--------------------------------

### Configure Android Gradle Build for java-llama.cpp

Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md

This Gradle configuration snippet integrates the java-llama.cpp library into your Android project. It includes steps to compile the library using Maven if necessary, declare C++ and Java sources, and configure CMake.

```gradle
android {
    val jllamaLib = file("java-llama.cpp")

    // Execute "mvn compile" if folder target/ doesn't exist at ./java-llama.cpp/
    if (!file("$jllamaLib/target").exists()) {
        exec {
            commandLine = listOf("mvn", "compile")
            workingDir = file("java-llama.cpp/")
        }
    }

    ...
    defaultConfig {
	...
        externalNativeBuild {
            cmake {
		// Add an flags if needed
                cppFlags += ""
                arguments += ""
            }
        }
    }

    // Declare c++ sources
    externalNativeBuild {
        cmake {
            path = file("$jllamaLib/CMakeLists.txt")
            version = "3.22.1"
        }
    }

    // Declare java sources
    sourceSets {
        named("main") {
            // Add source directory for java-llama.cpp
            java.srcDir("$jllamaLib/src/main/java")
        }
    }
}
```

--------------------------------

### Cancel Llama Text Generation in Java

Source: https://context7.com/kherud/java-llama.cpp/llms.txt

Demonstrates how to cancel text generation early using the `LlamaIterator`. Generation can be stopped based on a token count or custom output patterns. Requires the `de.kherud.llama.*` library.

```java
import de.kherud.llama.*;

ModelParameters modelParams = new ModelParameters()
        .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf");

try (LlamaModel model = new LlamaModel(modelParams)) {
    InferenceParameters params = new InferenceParameters("Write a very long story...")
            .setNPredict(1000);

    LlamaIterator iterator = model.generate(params).iterator();
    int tokenCount = 0;
    StringBuilder output = new StringBuilder();

    while (iterator.hasNext()) {
        LlamaOutput token = iterator.next();
        output.append(token);
        tokenCount++;

        // Cancel after 50 tokens or if we see a specific pattern
        if (tokenCount >= 50 || output.toString().contains("THE END")) {
            iterator.cancel();
            System.out.println("Generation cancelled after " + tokenCount + " tokens");
        }
    }

    System.out.println("Output: " + output);
}
```

--------------------------------

### Determine OS Name (CMake)

Source: https://github.com/kherud/java-llama.cpp/blob/master/CMakeLists.txt

Determines the operating system name by executing a Java class. This is crucial for platform-specific build configurations. It requires a prior 'mvn compile' to ensure the Java class is available.

```cmake
if(NOT DEFINED OS_NAME)
    find_package(Java REQUIRED)
    find_program(JAVA_EXECUTABLE NAMES java)
	execute_process(
      COMMAND ${JAVA_EXECUTABLE} -cp ${CMAKE_SOURCE_DIR}/target/classes de.kherud.llama.OSInfo --os
      OUTPUT_VARIABLE OS_NAME
      OUTPUT_STRIP_TRAILING_WHITESPACE
    )
endif()
if(NOT OS_NAME)
    message(FATAL_ERROR "Could not determine OS name")
endif()
```

--------------------------------

### Add java-llama.cpp as Git Submodule

Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md

This command adds the java-llama.cpp repository as a Git submodule to your Android project's app directory. This allows you to manage the library's code separately while keeping it within your project.

```shell
git submodule add https://github.com/kherud/java-llama.cpp
```

--------------------------------

### Add llama.cpp Java Dependency via Maven

Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md

This XML snippet shows how to add the llama.cpp Java bindings as a dependency to a Maven project. It specifies the group ID, artifact ID, and version required to include the library in your project's build path.

```xml
<dependency>
    <groupId>de.kherud</groupId>
    <artifactId>llama</artifactId>
    <version>4.1.0</version>
</dependency>
```

--------------------------------

### Exclude java-llama.cpp from ProGuard

Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md

This ProGuard rule ensures that the `de.kherud.llama` package and its contents are not stripped or obfuscated during the release build process. This is crucial for the library to function correctly after code shrinking.

```proguard
keep class de.kherud.llama.** { *; }
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.