### Basic Llama Model Inference Example in Java Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md Demonstrates a complete conversational example using LlamaModel. It sets up model parameters, including GPU layers, and engages in a loop where user input is processed and Llama's response is generated and displayed. The prompt is continuously updated to maintain context. Uses try-with-resources for automatic model cleanup. ```java public class Example { public static void main(String... args) throws IOException { ModelParameters modelParams = new ModelParameters() .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf") .setGpuLayers(43); String system = "This is a conversation between User and Llama, a friendly chatbot.\n" + "Llama is helpful, kind, honest, good at writing, and never fails to answer any " + "requests immediately and with precision.\n"; BufferedReader reader = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8)); try (LlamaModel model = new LlamaModel(modelParams)) { System.out.print(system); String prompt = system; while (true) { prompt += "\nUser: "; System.out.print("\nUser: "); String input = reader.readLine(); prompt += input; System.out.print("Llama: "); prompt += "\nLlama: "; InferenceParameters inferParams = new InferenceParameters(prompt) .setTemperature(0.7f) .setPenalizeNl(true) .setMiroStat(MiroStat.V2) .setStopStrings("User:"); for (LlamaOutput output : model.generate(inferParams)) { System.out.print(output); prompt += output; } } } } } ``` -------------------------------- ### Configuring Llama Model and Inference Parameters in Java Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md Illustrates how to configure `ModelParameters` and `InferenceParameters` using their builder patterns. This includes specifying the model path, adding LoRA adapters, setting grammar rules for controlled generation, and adjusting the temperature for creativity. The example uses try-with-resources for safe model loading and unloading. ```java ModelParameters modelParams = new ModelParameters() .setModel("/path/to/model.gguf") .addLoraAdapter("/path/to/lora/adapter"); String grammar = """ root ::= (expr "=" term "\n")+ expr ::= term ([-+*/] term)* term ::= [0-9]"""; InferenceParameters inferParams = new InferenceParameters("") .setGrammar(grammar) .setTemperature(0.8); try (LlamaModel model = new LlamaModel(modelParams)) { model.generate(inferParams); } ``` -------------------------------- ### Stream Text Generation in Java using generate() Source: https://context7.com/kherud/java-llama.cpp/llms.txt Illustrates how to perform token-by-token text generation using the `generate()` method, which returns an `LlamaIterable`. This allows for real-time output display and mid-stream cancellation. The example sets up a basic chat loop, processes user input, and streams the model's response. It requires the de.kherud.llama and de.kherud.llama.args libraries, along with standard Java IO. ```java import de.kherud.llama.*; import de.kherud.llama.args.MiroStat; import java.io.*; import java.nio.charset.StandardCharsets; public class StreamingChat { public static void main(String[] args) throws IOException { ModelParameters modelParams = new ModelParameters() .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf") .setGpuLayers(43); String systemPrompt = "This is a conversation between User and Assistant.\n" + "Assistant is helpful, kind, and honest.\n\n" + "User: Hello!\nAssistant: Hello! How can I help you today?"; BufferedReader reader = new BufferedReader( new InputStreamReader(System.in, StandardCharsets.UTF_8)); try (LlamaModel model = new LlamaModel(modelParams)) { String context = systemPrompt; while (true) { System.out.print("\nUser: "); String userInput = reader.readLine(); if (userInput == null || userInput.equalsIgnoreCase("quit")) break; context += "\nUser: " + userInput + "\nAssistant: "; InferenceParameters inferParams = new InferenceParameters(context) .setTemperature(0.7f) .setPenalizeNl(true) .setMiroStat(MiroStat.V2) .setStopStrings("User:", "\n\n"); System.out.print("Assistant: "); StringBuilder response = new StringBuilder(); for (LlamaOutput output : model.generate(inferParams)) { System.out.print(output); response.append(output); } System.out.println(); context += response.toString(); } } } } ``` -------------------------------- ### Maven Dependency Configuration for java-llama.cpp Source: https://context7.com/kherud/java-llama.cpp/llms.txt Provides Maven dependency configurations for including the java-llama.cpp library in a Java project. It shows the basic CPU-only dependency and an example for enabling CUDA GPU support on Linux x86-64 systems. ```xml de.kherud llama 4.2.0 de.kherud llama 4.2.0 cuda12-linux-x86-64 ``` -------------------------------- ### Configure JNI Include Directories (CMake) Source: https://github.com/kherud/java-llama.cpp/blob/master/CMakeLists.txt Sets the include directories for JNI headers based on the detected OS. It handles Unix-like systems (Linux, macOS) and Windows separately. If not found, it attempts to locate them via the Java installation. ```cmake if(NOT DEFINED JNI_INCLUDE_DIRS) if(OS_NAME MATCHES "^Linux" OR OS_NAME STREQUAL "Mac" OR OS_NAME STREQUAL "Darwin") set(JNI_INCLUDE_DIRS .github/include/unix) elseif(OS_NAME STREQUAL "Windows") set(JNI_INCLUDE_DIRS .github/include/windows) else() find_package(Java REQUIRED) find_program(JAVA_EXECUTABLE NAMES java) find_path(JNI_INCLUDE_DIRS NAMES jni.h HINTS ENV JAVA_HOME PATH_SUFFIXES include) file(GLOB_RECURSE JNI_MD_PATHS RELATIVE "${JNI_INCLUDE_DIRS}" "${JNI_INCLUDE_DIRS}/**/jni_md.h") foreach(PATH IN LISTS JNI_MD_PATHS) get_filename_component(DIR ${PATH} DIRECTORY) list(APPEND JNI_INCLUDE_DIRS "${JNI_INCLUDE_DIRS}/${DIR}") endforeach() endif() endif() if(NOT JNI_INCLUDE_DIRS) message(FATAL_ERROR "Could not determine JNI include directories") endif() ``` -------------------------------- ### Configure Logging in Java with Llama.cpp Source: https://context7.com/kherud/java-llama.cpp/llms.txt Illustrates how to configure logging for the Llama.cpp Java library. You can redirect logs to custom handlers, specify log formats (TEXT or JSON), and control verbosity. This is crucial for debugging and monitoring model behavior. The examples show setting loggers with different formats and disabling logging. ```java import de.kherud.llama.*; import de.kherud.llama.args.LogFormat; // Redirect logs to custom handler (e.g., logging framework) LlamaModel.setLogger(LogFormat.TEXT, (level, message) -> { switch (level) { case ERROR: System.err.println("[ERROR] " + message); break; case WARN: System.err.println("[WARN] " + message); break; case INFO: System.out.println("[INFO] " + message); break; case DEBUG: System.out.println("[DEBUG] " + message); break; } }); // Use JSON format for structured logging LlamaModel.setLogger(LogFormat.JSON, (level, message) -> { // message is already JSON formatted System.out.println(message); }); // Log to stdout with different format (pass null callback) LlamaModel.setLogger(LogFormat.TEXT, null); // Disable logging completely LlamaModel.setLogger(null, (level, message) -> {}); // Now load and use model ModelParameters params = new ModelParameters() .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf") .enableLogTimestamps() .enableLogPrefix() .setLogVerbosity(1); // 0 = minimal, higher = more verbose ``` -------------------------------- ### Customizing Llama Model Logging in Java Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md Shows how to intercept and customize log messages from the LlamaModel. It demonstrates setting a custom logger for both text and JSON formats, redirecting logs to `System.out`, and disabling logging entirely. The examples cover changing log format while still writing to stdout, and passing a no-op callback to disable logging. ```java // Re-direct log messages however you like (e.g. to a logging library) LlamaModel.setLogger(LogFormat.TEXT, (level, message) -> System.out.println(level.name() + ": " + message)); // Log to stdout, but change the format LlamaModel.setLogger(LogFormat.TEXT, null); // Disable logging by passing a no-op LlamaModel.setLogger(null, (level, message) -> {}); ``` -------------------------------- ### Convert JSON Schema to GBNF Grammar in Java Source: https://context7.com/kherud/java-llama.cpp/llms.txt This snippet shows how to convert a JSON schema into a GBNF grammar, enabling type-safe structured JSON output from the language model. It uses the static `jsonSchemaToGrammar` method from the `LlamaModel` class. The generated grammar can then be used with `InferenceParameters` to guide JSON generation. Dependencies include the de.kherud.llama library. ```java import de.kherud.llama.*; String jsonSchema = """ { "type": "object", "properties": { "name": {"type": "string"}, "age": {"type": "integer"}, "email": {"type": "string"} }, "required": ["name", "age"], "additionalProperties": false } """; // Convert JSON schema to grammar String grammar = LlamaModel.jsonSchemaToGrammar(jsonSchema); System.out.println("Generated Grammar:\n" + grammar); ModelParameters modelParams = new ModelParameters() .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf"); try (LlamaModel model = new LlamaModel(modelParams)) { InferenceParameters params = new InferenceParameters( "Generate a JSON object for a person named John who is 30 years old:") .setGrammar(grammar) .setNPredict(100); String jsonOutput = model.complete(params); System.out.println("JSON output: " + jsonOutput); // Output: {"name": "John", "age": 30} } ``` -------------------------------- ### Access Token Probabilities in Java Source: https://context7.com/kherud/java-llama.cpp/llms.txt Demonstrates how to access token probabilities during text generation in Java using the java-llama.cpp library. This allows for uncertainty estimation and analysis of the generated output. It requires the llama library and a pre-trained model. ```java import de.kherud.llama.*; import java.util.*; ModelParameters modelParams = new ModelParameters() .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf"); try (LlamaModel model = new LlamaModel(modelParams)) { InferenceParameters params = new InferenceParameters("The capital of France is") .setNProbs(5) // Return top 5 token probabilities .setTemperature(0.0f) // Deterministic for reproducibility .setNPredict(10); System.out.println("Generation with probabilities:"); for (LlamaOutput output : model.generate(params)) { System.out.print(output.text); // Access probabilities for this token if (!output.probabilities.isEmpty()) { System.out.println("\n Top tokens:"); output.probabilities.entrySet().stream() .sorted((a, b) -> Float.compare(b.getValue(), a.getValue())) .limit(3) .forEach(e -> System.out.printf(" '%s': %.4f%n", e.getKey(), e.getValue())); } } } ``` -------------------------------- ### Configure Inference Parameters in Java Source: https://context7.com/kherud/java-llama.cpp/llms.txt Demonstrates how to set various parameters for text generation using the InferenceParameters class. This includes controlling token limits, sampling strategies (temperature, top-k, top-p), repetition penalties, MiroStat sampling, stop conditions, and token biases. It requires the de.kherud.llama and de.kherud.llama.args libraries. ```java import de.kherud.llama.*; import de.kherud.llama.args.*; import java.util.*; // Basic inference parameters InferenceParameters params = new InferenceParameters("Write a haiku about Java programming.") // Token generation limits .setNPredict(100) // Max tokens to generate (-1 = infinite) .setSeed(42) // RNG seed for reproducibility // Temperature and sampling .setTemperature(0.7f) // Creativity (0.0 = deterministic, 1.0+ = creative) .setTopK(40) // Top-K sampling (0 = disabled) .setTopP(0.9f) // Top-P/nucleus sampling (1.0 = disabled) .setMinP(0.05f) // Min-P sampling .setTypicalP(1.0f) // Locally typical sampling // Repetition control .setRepeatLastN(64) // Tokens to consider for penalties .setRepeatPenalty(1.1f) // Repetition penalty (1.0 = disabled) .setFrequencyPenalty(0.0f) // Frequency penalty .setPresencePenalty(0.0f) // Presence penalty .setPenalizeNl(true) // Penalize newlines // MiroStat sampling .setMiroStat(MiroStat.V2) // DISABLED, V1, or V2 .setMiroStatTau(5.0f) // Target entropy .setMiroStatEta(0.1f) // Learning rate // Stop conditions .setStopStrings("User:", "\n\n", "###") // Caching .setCachePrompt(true); // Cache prompt for reuse // Sampler order customization InferenceParameters customSamplers = new InferenceParameters("Hello") .setSamplers(Sampler.TOP_K, Sampler.TOP_P, Sampler.TEMPERATURE); // Token bias - increase/decrease likelihood of specific tokens Map tokenBias = new HashMap<>(); tokenBias.put(15043, 1.5f); // Increase likelihood of token 15043 tokenBias.put(2, -1.0f); // Decrease likelihood of token 2 InferenceParameters biasedParams = new InferenceParameters("Hello") .setTokenIdBias(tokenBias); // Disable specific tokens InferenceParameters noTokens = new InferenceParameters("Hello") .disableTokenIds(Arrays.asList(1, 2, 3)); ``` -------------------------------- ### Llama Model Inference and Embedding in Java Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md Shows basic inference tasks with LlamaModel. It demonstrates how to load a model, generate a response to a prompt, complete a response in one go, and generate an embedding for a given text. Emphasizes that LlamaModel is stateless and context must be managed by appending outputs to prompts. Uses try-with-resources for proper resource management. ```java ModelParameters modelParams = new ModelParameters().setModel("/path/to/model.gguf"); InferenceParameters inferParams = new InferenceParameters("Tell me a joke."); try (LlamaModel model = new LlamaModel(modelParams)) { // Stream a response and access more information about each output. for (LlamaOutput output : model.generate(inferParams)) { System.out.print(output); } // Calculate a whole response before returning it. String response = model.complete(inferParams); // Returns the hidden representation of the context + prompt. float[] embedding = model.embed("Embed this"); } ``` -------------------------------- ### Java ModelParameters Configuration Source: https://context7.com/kherud/java-llama.cpp/llms.txt Illustrates the extensive configuration options available through the ModelParameters class for loading and initializing Llama models. This includes setting model sources, GPU offloading, threading, memory management, LoRA adapters, and sampling parameters. ```java import de.kherud.llama.*; import de.kherud.llama.args.*; ModelParameters params = new ModelParameters() // Model source (choose one) .setModel("/path/to/model.gguf") // Or download from URL (requires -DLLAMA_CURL=ON during build) // .setModelUrl("https://huggingface.co/TheBloke/Mistral-7B-GGUF/resolve/main/mistral-7b.Q4_K_M.gguf") // Or from Hugging Face // .setHfRepo("TheBloke/Mistral-7B-GGUF") // .setHfFile("mistral-7b.Q4_K_M.gguf") // GPU and performance .setGpuLayers(35) // Layers to offload to GPU (0 = CPU only) .setThreads(8) // CPU threads for generation .setThreadsBatch(8) // CPU threads for batch processing .setBatchSize(512) // Logical batch size .setCtxSize(4096) // Context window size // Memory and caching .enableFlashAttn() // Enable Flash Attention .enableMlock() // Lock model in RAM .setCacheTypeK(CacheType.F16) .setCacheTypeV(CacheType.F16) // LoRA adapters .addLoraAdapter("/path/to/lora-adapter.bin") .addLoraScaledAdapter("/path/to/another-lora.bin", 0.5f) // Sampling defaults .setTemp(0.8f) .setTopK(40) .setTopP(0.95f) .setRepeatPenalty(1.1f) // Special modes .enableEmbedding() // Enable embedding generation .enableReranking() // Enable document reranking // Logging .enableLogTimestamps() .enableLogPrefix(); ``` -------------------------------- ### Basic Java LlamaModel Usage Source: https://context7.com/kherud/java-llama.cpp/llms.txt Demonstrates the core usage of the LlamaModel class for loading a model and performing both streaming and non-streaming text generation. It highlights the use of try-with-resources for proper memory management of native resources. ```java import de.kherud.llama.*; import de.kherud.llama.args.MiroStat; public class BasicUsage { public static void main(String[] args) { // Configure and load the model ModelParameters modelParams = new ModelParameters() .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf") .setGpuLayers(43) // Number of layers to offload to GPU .setCtxSize(2048); // Context window size try (LlamaModel model = new LlamaModel(modelParams)) { // Streaming generation InferenceParameters inferParams = new InferenceParameters("Tell me a joke about programming.") .setTemperature(0.7f) .setTopP(0.9f) .setNPredict(100); System.out.print("Response: "); for (LlamaOutput output : model.generate(inferParams)) { System.out.print(output); } System.out.println(); // Non-streaming completion String response = model.complete(inferParams); System.out.println("Complete response: " + response); } } } ``` -------------------------------- ### Compile llama.cpp Java Bindings Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md These shell commands demonstrate the process of compiling the llama.cpp Java bindings. It includes running Maven to compile the Java code, followed by CMake commands to configure and build the native libraries, with an option to enable CUDA support. ```shell mvn compile cmake -B build -DGGML_CUDA=ON cmake --build build --config Release ``` -------------------------------- ### Perform Code Infilling in Java Source: https://context7.com/kherud/java-llama.cpp/llms.txt This snippet demonstrates code infilling, where the model generates code to fill the gap between a given prefix and suffix. This is useful for code completion and insertion tasks. It utilizes `setInputPrefix` and `setInputSuffix` within `InferenceParameters`. Dependencies include the de.kherud.llama library and a suitable code model like codellama. ```java import de.kherud.llama.*; ModelParameters modelParams = new ModelParameters() .setModel("models/codellama-7b.Q2_K.gguf") .setGpuLayers(43); try (LlamaModel model = new LlamaModel(modelParams)) { String prefix = "def remove_non_ascii(s: str) -> str:\n \"\"\" "; String suffix = "\n return result\n \"\"\"; InferenceParameters params = new InferenceParameters("") .setInputPrefix(prefix) .setInputSuffix(suffix) .setTemperature(0.2f) .setNPredict(100) .setStopStrings("\"\"\"); System.out.print(prefix); for (LlamaOutput output : model.generate(params)) { System.out.print(output); } System.out.print(suffix); // Expected output: // def remove_non_ascii(s: str) -> str: // """ Remove non-ASCII characters from a string. // // Args: // s: Input string // // Returns: // String with only ASCII characters // """ // result = ''.join(char for char in s if ord(char) < 128) // return result } ``` -------------------------------- ### Format Chat Messages with Jinja Templates in Java Source: https://context7.com/kherud/java-llama.cpp/llms.txt Demonstrates how to use Jinja templating with the Llama.cpp Java library to format chat messages, including system prompts and conversation history. This ensures proper structure for the model's input. It requires enabling Jinja templating in ModelParameters and setting messages using `InferenceParameters.setMessages()`. ```java import de.kherud.llama.*; import java.util.*; ModelParameters modelParams = new ModelParameters() .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf") .enableJinja(); // Enable Jinja templating try (LlamaModel model = new LlamaModel(modelParams)) { // Build conversation with system message and history String systemMessage = "You are a helpful coding assistant."; List> messages = new ArrayList<>(); messages.add(new Pair<>("user", "What is a Python list comprehension?")); messages.add(new Pair<>("assistant", "A list comprehension is a concise way to create lists in Python.")); messages.add(new Pair<>("user", "Show me an example.")); InferenceParameters params = new InferenceParameters("") .setMessages(systemMessage, messages) .setTemperature(0.7f) .setNPredict(200); // Apply chat template to see formatted prompt String formattedPrompt = model.applyTemplate(params); System.out.println("Formatted prompt:\n" + formattedPrompt); // Output shows proper chat formatting: // <|im_start|>system // You are a helpful coding assistant.<|im_end|> // <|im_start|>user // What is a Python list comprehension?<|im_end|> // ... // Generate response System.out.println("\nResponse:"); for (LlamaOutput output : model.generate(params)) { System.out.print(output); } } ``` -------------------------------- ### Fetch json Dependency (CMake) Source: https://github.com/kherud/java-llama.cpp/blob/master/CMakeLists.txt Fetches the nlohmann/json library from GitHub using FetchContent. This is a dependency for JSON handling within the project. It specifies the Git repository and a specific tag for version control. ```cmake FetchContent_Declare( json GIT_REPOSITORY https://github.com/nlohmann/json GIT_TAG v3.11.3 ) FetchContent_MakeAvailable(json) ``` -------------------------------- ### Non-Streaming Text Completion with complete() in Java Source: https://context7.com/kherud/java-llama.cpp/llms.txt Shows how to use the `complete()` method for non-streaming text generation, where the entire response is returned at once. This is useful when the full output is needed before proceeding. Requires the `de.kherud.llama.*` library. ```java import de.kherud.llama.*; ModelParameters modelParams = new ModelParameters() .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf") .setGpuLayers(43); try (LlamaModel model = new LlamaModel(modelParams)) { InferenceParameters params = new InferenceParameters( "Translate to French: Hello, how are you today?") .setTemperature(0.3f) .setNPredict(50) .setSeed(42); String response = model.complete(params); System.out.println("Translation: " + response); // Output: Translation: Bonjour, comment allez-vous aujourd'hui? } ``` -------------------------------- ### Generate Text Embeddings in Java Source: https://context7.com/kherud/java-llama.cpp/llms.txt Illustrates how to generate vector embeddings for text using embedding-capable models. This requires enabling embedding mode during model initialization and uses the `embed()` method. Requires the `de.kherud.llama.*` library. ```java import de.kherud.llama.*; ModelParameters modelParams = new ModelParameters() .setModel("models/nomic-embed-text-v1.5.Q4_K_M.gguf") .enableEmbedding(); // Required for embedding generation try (LlamaModel model = new LlamaModel(modelParams)) { String text = "Machine learning is a subset of artificial intelligence."; float[] embedding = model.embed(text); System.out.println("Embedding dimensions: " + embedding.length); System.out.println("First 5 values: "); for (int i = 0; i < Math.min(5, embedding.length); i++) { System.out.printf(" [%d]: %.6f%n", i, embedding[i]); } // Calculate cosine similarity between two embeddings float[] embedding2 = model.embed("AI includes machine learning and deep learning."); double similarity = cosineSimilarity(embedding, embedding2); System.out.printf("Cosine similarity: %.4f%n", similarity); } static double cosineSimilarity(float[] a, float[] b) { double dotProduct = 0.0, normA = 0.0, normB = 0.0; for (int i = 0; i < a.length; i++) { dotProduct += a[i] * b[i]; normA += a[i] * a[i]; normB += b[i] * b[i]; } return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB)); } ``` -------------------------------- ### Tokenize Text with encode() and decode() in Java Source: https://context7.com/kherud/java-llama.cpp/llms.txt Demonstrates direct access to the model's tokenizer for converting text to token IDs (`encode()`) and token IDs back to text (`decode()`). This is useful for pre-generation token counting. Requires the `de.kherud.llama.*` library. ```java import de.kherud.llama.*; import java.util.Arrays; ModelParameters modelParams = new ModelParameters() .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf"); try (LlamaModel model = new LlamaModel(modelParams)) { String text = "Hello, world! How are you?"; // Encode text to token IDs int[] tokens = model.encode(text); System.out.println("Original text: " + text); System.out.println("Token count: " + tokens.length); System.out.println("Token IDs: " + Arrays.toString(tokens)); // Decode token IDs back to text String decoded = model.decode(tokens); System.out.println("Decoded text: " + decoded); // Note: Llama tokenizer adds a space prefix, so decoded may have leading space // Useful for counting tokens before generation String longPrompt = "This is a very long prompt..."; int tokenCount = model.encode(longPrompt).length; System.out.println("Prompt uses " + tokenCount + " tokens"); } ``` -------------------------------- ### Rerank Documents by Relevance in Java Source: https://context7.com/kherud/java-llama.cpp/llms.txt Shows how to use the Llama.cpp Java library for document reranking. It allows you to score and sort documents based on their relevance to a given query. This feature requires enabling reranking in ModelParameters. The output provides raw relevance scores and a sorted list of documents. ```java import de.kherud.llama.*; import java.util.*; ModelParameters modelParams = new ModelParameters() .setModel("models/jina-reranker-v1-tiny-en-Q4_0.gguf") .setCtxSize(512) .enableReranking(); // Required for reranking try (LlamaModel model = new LlamaModel(modelParams)) { String query = "Machine learning applications"; String[] documents = { "A machine is a physical system that uses power to perform actions.", "Learning is the process of acquiring new knowledge and skills.", "Machine learning is a field of AI that enables computers to learn from data.", "Paris is the capital of France and a major European city." }; // Get raw reranking scores LlamaOutput output = model.rerank(query, documents); System.out.println("Relevance scores:"); for (Map.Entry entry : output.probabilities.entrySet()) { System.out.printf(" %.4f: %s...%n", entry.getValue(), entry.getKey().substring(0, Math.min(50, entry.getKey().length()))); } // Get sorted results (most relevant first) List> rankedDocs = model.rerank(true, query, documents); System.out.println("\nRanked documents (best first):"); for (int i = 0; i < rankedDocs.size(); i++) { Pair doc = rankedDocs.get(i); System.out.printf("%d. [%.4f] %s...%n", i + 1, doc.getValue(), doc.getKey().substring(0, Math.min(50, doc.getKey().length()))); } // Output: // 1. [0.9823] Machine learning is a field of AI that enables... // 2. [0.3421] Learning is the process of acquiring new know... // 3. [0.1234] A machine is a physical system that uses power... // 4. [0.0012] Paris is the capital of France and a major Eur... } ``` -------------------------------- ### Fetch llama.cpp Dependency (CMake) Source: https://github.com/kherud/java-llama.cpp/blob/master/CMakeLists.txt Fetches the llama.cpp library from its GitHub repository using FetchContent. This is the core C++ library for llama model inference. It specifies the Git repository and a specific commit hash for versioning. ```cmake FetchContent_Declare( llama.cpp GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git GIT_TAG b4916 ) FetchContent_MakeAvailable(llama.cpp) ``` -------------------------------- ### Build jllama Shared Library (CMake) Source: https://github.com/kherud/java-llama.cpp/blob/master/CMakeLists.txt Defines and builds the 'jllama' shared library. It includes the main C++ source file and header files. It sets compiler features to C++11 and links against common, llama, and nlohmann_json libraries. ```cmake add_library(jllama SHARED src/main/cpp/jllama.cpp src/main/cpp/server.hpp src/main/cpp/utils.hpp) set_target_properties(jllama PROPERTIES POSITION_INDEPENDENT_CODE ON) target_include_directories(jllama PRIVATE src/main/cpp ${JNI_INCLUDE_DIRS}) target_link_libraries(jllama PRIVATE common llama nlohmann_json) target_compile_features(jllama PRIVATE cxx_std_11) target_compile_definitions(jllama PRIVATE SERVER_VERBOSE=$ ) ``` -------------------------------- ### Set Runtime Output Directory (CMake) Source: https://github.com/kherud/java-llama.cpp/blob/master/CMakeLists.txt Configures the output directory for the built 'jllama' library based on the operating system. For Windows, it sets specific debug, release, and relwithdebinfo directories. For other OS, it sets a general library output directory. ```cmake if(OS_NAME STREQUAL "Windows") set_target_properties(jllama llama ggml PROPERTIES RUNTIME_OUTPUT_DIRECTORY_DEBUG ${JLLAMA_DIR} RUNTIME_OUTPUT_DIRECTORY_RELEASE ${JLLAMA_DIR} RUNTIME_OUTPUT_DIRECTORY_RELWITHDEBINFO ${JLLAMA_DIR} ) else() set_target_properties(jllama llama ggml PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${JLLAMA_DIR} ) endif() ``` -------------------------------- ### Constrain Model Output with BNF-like Grammar in Java Source: https://context7.com/kherud/java-llama.cpp/llms.txt This snippet demonstrates how to constrain model output to follow a BNF-like grammar, ensuring structured output formats like JSON, mathematical expressions, or custom patterns. It utilizes the LlamaModel class and InferenceParameters to set the grammar and generate text accordingly. Dependencies include the de.kherud.llama library. ```java import de.kherud.llama.*; ModelParameters modelParams = new ModelParameters() .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf"); try (LlamaModel model = new LlamaModel(modelParams)) { // Grammar for simple arithmetic expressions String arithmeticGrammar = """ root ::= (expr "=" term "\n")+ expr ::= term ([-+*/] term)* term ::= [0-9] """; InferenceParameters params = new InferenceParameters( "Generate some arithmetic expressions:") .setGrammar(arithmeticGrammar) .setNPredict(50); System.out.println("Arithmetic expressions:"); for (LlamaOutput output : model.generate(params)) { System.out.print(output); } // Output example: // 3+4=7 // 9-2=7 // 5*1=5 // Grammar for yes/no answers only String yesNoGrammar = "root ::= (\"yes\" | \"no\")"; InferenceParameters yesNoParams = new InferenceParameters( "Is the sky blue? Answer: ") .setGrammar(yesNoGrammar) .setNPredict(5); String answer = model.complete(yesNoParams); System.out.println("\nYes/No answer: " + answer); // Grammar for character sequences String abGrammar = "root ::= (\"a\" | \"b\")+"; InferenceParameters abParams = new InferenceParameters("") .setGrammar(abGrammar) .setNPredict(20); String abSequence = model.complete(abParams); System.out.println("AB sequence: " + abSequence); // e.g., "aababbaab" } ``` -------------------------------- ### Determine OS Architecture (CMake) Source: https://github.com/kherud/java-llama.cpp/blob/master/CMakeLists.txt Determines the CPU architecture by executing a Java class. This is necessary for selecting the correct pre-compiled binaries or optimizing builds. It relies on 'mvn compile' having been executed previously. ```cmake if(NOT DEFINED OS_ARCH) find_package(Java REQUIRED) find_program(JAVA_EXECUTABLE NAMES java) execute_process( COMMAND ${JAVA_EXECUTABLE} -cp ${CMAKE_SOURCE_DIR}/target/classes de.kherud.llama.OSInfo --arch OUTPUT_VARIABLE OS_ARCH OUTPUT_STRIP_TRAILING_WHITESPACE ) endif() if(NOT OS_ARCH) message(FATAL_ERROR "Could not determine CPU architecture") endif() ``` -------------------------------- ### Copy Metal Shader (CMake) Source: https://github.com/kherud/java-llama.cpp/blob/master/CMakeLists.txt Conditionally copies the 'ggml-metal.metal' shader file to the output directory when building with Metal support and not embedding the library. This ensures the Metal shader is available at runtime for macOS GPU acceleration. ```cmake if (LLAMA_METAL AND NOT LLAMA_METAL_EMBED_LIBRARY) configure_file(${llama.cpp_SOURCE_DIR}/ggml-metal.metal ${JLLAMA_DIR}/ggml-metal.metal COPYONLY) endif() ``` -------------------------------- ### Configure Android Gradle Build for java-llama.cpp Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md This Gradle configuration snippet integrates the java-llama.cpp library into your Android project. It includes steps to compile the library using Maven if necessary, declare C++ and Java sources, and configure CMake. ```gradle android { val jllamaLib = file("java-llama.cpp") // Execute "mvn compile" if folder target/ doesn't exist at ./java-llama.cpp/ if (!file("$jllamaLib/target").exists()) { exec { commandLine = listOf("mvn", "compile") workingDir = file("java-llama.cpp/") } } ... defaultConfig { ... externalNativeBuild { cmake { // Add an flags if needed cppFlags += "" arguments += "" } } } // Declare c++ sources externalNativeBuild { cmake { path = file("$jllamaLib/CMakeLists.txt") version = "3.22.1" } } // Declare java sources sourceSets { named("main") { // Add source directory for java-llama.cpp java.srcDir("$jllamaLib/src/main/java") } } } ``` -------------------------------- ### Cancel Llama Text Generation in Java Source: https://context7.com/kherud/java-llama.cpp/llms.txt Demonstrates how to cancel text generation early using the `LlamaIterator`. Generation can be stopped based on a token count or custom output patterns. Requires the `de.kherud.llama.*` library. ```java import de.kherud.llama.*; ModelParameters modelParams = new ModelParameters() .setModel("models/mistral-7b-instruct-v0.2.Q2_K.gguf"); try (LlamaModel model = new LlamaModel(modelParams)) { InferenceParameters params = new InferenceParameters("Write a very long story...") .setNPredict(1000); LlamaIterator iterator = model.generate(params).iterator(); int tokenCount = 0; StringBuilder output = new StringBuilder(); while (iterator.hasNext()) { LlamaOutput token = iterator.next(); output.append(token); tokenCount++; // Cancel after 50 tokens or if we see a specific pattern if (tokenCount >= 50 || output.toString().contains("THE END")) { iterator.cancel(); System.out.println("Generation cancelled after " + tokenCount + " tokens"); } } System.out.println("Output: " + output); } ``` -------------------------------- ### Determine OS Name (CMake) Source: https://github.com/kherud/java-llama.cpp/blob/master/CMakeLists.txt Determines the operating system name by executing a Java class. This is crucial for platform-specific build configurations. It requires a prior 'mvn compile' to ensure the Java class is available. ```cmake if(NOT DEFINED OS_NAME) find_package(Java REQUIRED) find_program(JAVA_EXECUTABLE NAMES java) execute_process( COMMAND ${JAVA_EXECUTABLE} -cp ${CMAKE_SOURCE_DIR}/target/classes de.kherud.llama.OSInfo --os OUTPUT_VARIABLE OS_NAME OUTPUT_STRIP_TRAILING_WHITESPACE ) endif() if(NOT OS_NAME) message(FATAL_ERROR "Could not determine OS name") endif() ``` -------------------------------- ### Add java-llama.cpp as Git Submodule Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md This command adds the java-llama.cpp repository as a Git submodule to your Android project's app directory. This allows you to manage the library's code separately while keeping it within your project. ```shell git submodule add https://github.com/kherud/java-llama.cpp ``` -------------------------------- ### Add llama.cpp Java Dependency via Maven Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md This XML snippet shows how to add the llama.cpp Java bindings as a dependency to a Maven project. It specifies the group ID, artifact ID, and version required to include the library in your project's build path. ```xml de.kherud llama 4.1.0 ``` -------------------------------- ### Exclude java-llama.cpp from ProGuard Source: https://github.com/kherud/java-llama.cpp/blob/master/README.md This ProGuard rule ensures that the `de.kherud.llama` package and its contents are not stripped or obfuscated during the release build process. This is crucial for the library to function correctly after code shrinking. ```proguard keep class de.kherud.llama.** { *; } ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.