# PromptTools PromptTools is an open-source toolkit created by Hegel AI for testing, experimenting with, and evaluating Large Language Models (LLMs), vector databases, and prompts. It enables developers to systematically test prompts and parameters across different models including OpenAI, Anthropic, Google Gemini, Mistral, LLaMA, and more through familiar interfaces like Python code, Jupyter notebooks, and a local playground. The library works by creating experiments that take cartesian products of input arguments, executing them against LLM APIs or vector databases, and collecting results into DataFrames for analysis. It provides built-in evaluation functions for semantic similarity, JSON validation, auto-evaluation using GPT-4, and more. Results can be exported to CSV, JSON, MongoDB, or LoRA-format JSON for fine-tuning, making it a complete solution for prompt engineering workflows. ## OpenAIChatExperiment The `OpenAIChatExperiment` class runs experiments against OpenAI's chat completion API. It accepts lists of parameters and creates all possible combinations, executing each against the API and collecting responses with latency metrics. ```python import os from prompttools.experiment import OpenAIChatExperiment os.environ["OPENAI_API_KEY"] = "your-api-key" # Define experiment parameters - each should be a list models = ["gpt-3.5-turbo", "gpt-4"] messages = [ [{"role": "user", "content": "Tell me a joke."}], [{"role": "user", "content": "Is 17077 a prime number?"}], ] temperatures = [0.0, 1.0] # Create and run experiment experiment = OpenAIChatExperiment( model=models, messages=messages, temperature=temperatures ) experiment.run() # View results in a table experiment.visualize() # Get results as a pandas DataFrame df = experiment.get_table(get_all_cols=True) # Export results to CSV experiment.to_csv("results.csv") # Export to JSON experiment.to_json("results.json") ``` ## OpenAIChatExperiment with Function Calling The experiment supports OpenAI's function calling feature, allowing you to test structured output generation with different functions and parameters. ```python from prompttools.experiment import OpenAIChatExperiment # Define functions for the model to use functions = [ { "name": "get_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA" }, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} }, "required": ["location"] } } ] messages = [ [{"role": "user", "content": "What's the weather like in Boston?"}], [{"role": "user", "content": "Tell me the temperature in Tokyo."}], ] experiment = OpenAIChatExperiment( model=["gpt-3.5-turbo", "gpt-4"], messages=messages, functions=[functions], function_call=[{"name": "get_weather"}], temperature=[0.0] ) experiment.run() experiment.visualize() ``` ## OpenAIChatExperiment.initialize An alternate way to initialize experiments by separating test parameters (varying) from frozen parameters (constant), without needing to wrap frozen values in lists. ```python from prompttools.experiment import OpenAIChatExperiment # Parameters to test - values are lists test_parameters = { "model": ["gpt-3.5-turbo", "gpt-4"], "temperature": [0.0, 0.5, 1.0] } # Parameters to keep constant - values are NOT lists messages = [{"role": "user", "content": "Who was the first president?"}] frozen_parameters = { "top_p": 1.0, "messages": messages, "presence_penalty": 0.0 } # Initialize and run experiment = OpenAIChatExperiment.initialize(test_parameters, frozen_parameters) experiment.run() experiment.visualize() ``` ## AnthropicCompletionExperiment Test prompts against Anthropic's Claude models with configurable parameters for temperature, token limits, and sampling strategies. ```python import os from prompttools.experiment import AnthropicCompletionExperiment from anthropic import HUMAN_PROMPT, AI_PROMPT os.environ["ANTHROPIC_API_KEY"] = "your-api-key" # Format prompts using Anthropic's required format prompts = [ f"{HUMAN_PROMPT} What is the capital of France? {AI_PROMPT}", f"{HUMAN_PROMPT} Explain quantum computing in simple terms. {AI_PROMPT}", ] experiment = AnthropicCompletionExperiment( model=["claude-2", "claude-instant-1"], prompt=prompts, max_tokens_to_sample=[500, 1000], temperature=[0.0, 0.7] ) experiment.run() experiment.visualize() # Get results as DataFrame df = experiment.get_table() print(df[["model", "prompt", "response", "latency"]]) ``` ## GoogleGeminiChatCompletionExperiment Run experiments against Google's Gemini models using the Google Generative AI SDK with configurable generation and safety settings. ```python import os import google.generativeai as genai from prompttools.experiment import GoogleGeminiChatCompletionExperiment # Configure API key genai.configure(api_key=os.environ["GOOGLE_API_KEY"]) # Define prompts contents = [ "Explain machine learning to a 10-year-old", "Write a haiku about programming", ] # Optional generation config generation_config = genai.types.GenerationConfig( temperature=0.7, max_output_tokens=256 ) experiment = GoogleGeminiChatCompletionExperiment( model=["gemini-pro"], contents=contents, generation_config=[generation_config, None] # Test with and without config ) experiment.run() experiment.visualize() ``` ## MistralChatCompletionExperiment Execute experiments against Mistral AI's chat completion API with support for safety prompts and deterministic generation using random seeds. ```python import os from prompttools.experiment import MistralChatCompletionExperiment from mistralai.models.chat_completion import ChatMessage os.environ["MISTRAL_API_KEY"] = "your-api-key" # Create messages using Mistral's ChatMessage format messages = [ [ChatMessage(role="user", content="What is the meaning of life?")], [ChatMessage(role="user", content="Explain recursion with an example.")], ] experiment = MistralChatCompletionExperiment( model=["mistral-tiny", "mistral-small"], messages=messages, temperature=[0.3, 0.7], max_tokens=[500], safe_prompt=[True] # Enable safety filtering ) experiment.run() experiment.visualize() ``` ## LlamaCppExperiment Test local LLaMA models via llama.cpp with full control over model initialization and inference parameters. ```python from prompttools.experiment import LlamaCppExperiment # Define model and inference parameters model_paths = [ "/path/to/llama-2-7b.gguf", "/path/to/llama-2-13b.gguf" ] prompts = [ "Write a short poem about coding:", "Explain why the sky is blue:", ] # Model initialization parameters model_params = { "n_ctx": [2048], # Context window size "n_threads": [8], # CPU threads to use } # Inference parameters call_params = { "max_tokens": [256], "temperature": [0.7, 1.0], "top_p": [0.9], "repeat_penalty": [1.1] } experiment = LlamaCppExperiment( model_path=model_paths, prompt=prompts, model_params=model_params, call_params=call_params ) experiment.run() experiment.visualize() ``` ## ChromaDBExperiment Test vector database retrieval with different embedding functions and query parameters using ChromaDB. ```python import chromadb from prompttools.experiment import ChromaDBExperiment # Initialize ChromaDB client chroma_client = chromadb.Client() # Documents to add to the collection documents = [ "Machine learning is a subset of artificial intelligence.", "Python is a popular programming language for data science.", "Neural networks are inspired by biological neurons.", "Deep learning uses multiple layers of neural networks.", ] # Query parameters to test query_params = { "query_texts": [ ["What is machine learning?"], ["Tell me about Python programming"], ], "n_results": [2, 3] # Number of results to retrieve } experiment = ChromaDBExperiment( chroma_client=chroma_client, collection_name="test_collection", use_existing_collection=False, # Create new collection query_collection_params=query_params, add_to_collection_params={ "documents": documents, "ids": [f"doc_{i}" for i in range(len(documents))] } ) experiment.run() # View retrieved documents and distances experiment.visualize() df = experiment.get_table() print(df[["query_texts", "top doc ids", "distances", "documents"]]) ``` ## Experiment.evaluate Add custom evaluation metrics to experiment results using the evaluate method. Built-in evaluation functions include semantic similarity, JSON validation, and auto-evaluation with GPT-4. ```python from prompttools.experiment import OpenAIChatExperiment from prompttools.utils import ( semantic_similarity, validate_json_response, autoeval_binary_scoring ) # Run experiment experiment = OpenAIChatExperiment( model=["gpt-3.5-turbo"], messages=[[{"role": "user", "content": "What is 2+2? Answer with just the number."}]], temperature=[0.0] ) experiment.run() # Evaluate with semantic similarity against expected responses expected_responses = ["4"] # One expected response per experiment row experiment.evaluate( "similarity_score", semantic_similarity, static_eval_fn_kwargs={"response_column_name": "response"}, expected=expected_responses ) # Validate JSON format (returns 1.0 if valid, 0.0 if not) experiment.evaluate( "is_valid_json", validate_json_response, static_eval_fn_kwargs={"response_column_name": "response"} ) # Auto-evaluate with GPT-4 (returns 1.0 if response follows directions) experiment.evaluate( "follows_directions", autoeval_binary_scoring, static_eval_fn_kwargs={ "prompt_column_name": "messages", "response_column_name": "response" } ) # View results with evaluation scores experiment.visualize() ``` ## Experiment.aggregate and rank Aggregate and rank experiment results by specific metrics to identify the best performing models, prompts, or configurations. ```python from prompttools.experiment import OpenAIChatExperiment from prompttools.utils import semantic_similarity # Run experiment with multiple models and prompts experiment = OpenAIChatExperiment( model=["gpt-3.5-turbo", "gpt-4"], messages=[ [{"role": "user", "content": "Explain gravity in one sentence."}], [{"role": "user", "content": "What causes rain?"}], ], temperature=[0.0, 0.5] ) experiment.run() # Add evaluation metric expected = ["Gravity is the force of attraction between masses."] * 4 experiment.evaluate( "similarity", semantic_similarity, static_eval_fn_kwargs={"response_column_name": "response"}, expected=expected ) # Aggregate latency by model (compute average latency per model) experiment.aggregate( metric_name="latency", column_name="model", is_average=True ) # Rank models by similarity score rankings = experiment.rank( metric_name="similarity", is_average=True, agg_column="model" ) print("Model rankings by similarity:", rankings) # Output: {'gpt-4': 0.92, 'gpt-3.5-turbo': 0.85} ``` ## ChatPromptTemplateExperimentationHarness Use Jinja2 templates to test different prompt structures with variable user inputs, automatically generating all combinations. ```python from prompttools.experiment import OpenAIChatExperiment from prompttools.harness import ChatPromptTemplateExperimentationHarness # Define message templates using Jinja2 syntax templates = [ # Template 1: Direct question [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "{{question}}"} ], # Template 2: With context instruction [ {"role": "system", "content": "You are an expert. Be concise and accurate."}, {"role": "user", "content": "Please answer: {{question}}"} ], ] # User inputs to test with each template user_inputs = [ {"question": "What is photosynthesis?"}, {"question": "How do computers work?"}, ] # Create harness harness = ChatPromptTemplateExperimentationHarness( experiment=OpenAIChatExperiment, model_name="gpt-3.5-turbo", message_templates=templates, user_inputs=user_inputs, model_arguments={"temperature": 0.0} # Frozen parameters ) # Run and visualize harness.run() harness.visualize() # Aggregate latency by template latency_by_template = harness.aggregate( groupby_column="templates", aggregate_columns="latency", method="mean" ) print(latency_by_template) ``` ## ExperimentationHarness.evaluate The harness provides the same evaluation capabilities as experiments, allowing systematic evaluation across template variations. ```python from prompttools.experiment import OpenAIChatExperiment from prompttools.harness import ChatPromptTemplateExperimentationHarness from prompttools.utils import semantic_similarity templates = [ [ {"role": "system", "content": "Answer questions briefly."}, {"role": "user", "content": "{{query}}"} ], [ {"role": "system", "content": "You are a teacher. Explain clearly."}, {"role": "user", "content": "Student asks: {{query}}"} ], ] user_inputs = [ {"query": "What is DNA?"}, {"query": "Why is the ocean salty?"}, ] harness = ChatPromptTemplateExperimentationHarness( experiment=OpenAIChatExperiment, model_name="gpt-3.5-turbo", message_templates=templates, user_inputs=user_inputs ) harness.run() # Define expected responses for evaluation expected_responses = [ "DNA is a molecule containing genetic instructions.", "DNA is the molecule that carries genetic information.", "Ocean is salty due to dissolved minerals from rocks.", "The ocean is salty because of minerals from rivers and rocks." ] # Evaluate responses harness.evaluate( "quality", semantic_similarity, static_eval_fn_kwargs={"response_column_name": "response"}, expected=expected_responses ) harness.visualize() ``` ## prompttest Decorator Create automated test suites for prompts using the `@prompttest` decorator, enabling CI/CD integration for prompt quality assurance. ```python import openai from prompttools import prompttest from prompttools.utils import semantic_similarity from prompttools.prompttest.threshold_type import ThresholdType # Define the completion function to test def my_completion_fn(prompt: str) -> str: response = openai.chat.completions.create( model="gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}] ) return response.choices[0].message.content # Define test prompts and expected responses test_prompts = [ "What is 2 + 2?", "What is the capital of France?", ] expected_responses = [ "4", "Paris", ] # Create a prompt test with semantic similarity evaluation @prompttest( metric_name="similarity", eval_fn=semantic_similarity, prompts=test_prompts, threshold=0.7, # Minimum similarity score threshold_type=ThresholdType.MINIMUM, expected=expected_responses ) def test_basic_qa(prompt: str) -> str: return my_completion_fn(prompt) # Run tests (typically in test file) if __name__ == "__main__": from prompttools.prompttest import main main() # Output: Running 1 test(s) # Output: All 1 test(s) passed! ``` ## validate_json_response Validate that model responses are properly formatted JSON, useful for testing structured output from function calling or JSON mode. ```python from prompttools.experiment import OpenAIChatExperiment from prompttools.utils import validate_json_response # Test JSON generation capability messages = [ [{"role": "user", "content": "Return a JSON object with keys 'name' and 'age' for a person named John who is 30."}], [{"role": "user", "content": "Create a JSON array of 3 colors."}], ] experiment = OpenAIChatExperiment( model=["gpt-3.5-turbo", "gpt-4"], messages=messages, response_format=[{"type": "json_object"}], # Enable JSON mode temperature=[0.0] ) experiment.run() # Validate JSON format experiment.evaluate( "is_valid_json", validate_json_response, static_eval_fn_kwargs={"response_column_name": "response"} ) # View results - is_valid_json will be 1.0 for valid JSON, 0.0 otherwise df = experiment.get_table() print(df[["model", "response", "is_valid_json"]]) ``` ## autoeval_binary_scoring Use GPT-4 as an automated judge to evaluate whether model responses follow the given directions, returning binary scores. ```python import os from prompttools.experiment import OpenAIChatExperiment from prompttools.utils import autoeval_binary_scoring os.environ["OPENAI_API_KEY"] = "your-api-key" # Test instruction following messages = [ [{"role": "user", "content": "List exactly 3 fruits, one per line."}], [{"role": "user", "content": "Write a haiku about winter."}], [{"role": "user", "content": "Respond with only 'yes' or 'no': Is water wet?"}], ] experiment = OpenAIChatExperiment( model=["gpt-3.5-turbo"], messages=messages, temperature=[0.0, 0.7] ) experiment.run() # Auto-evaluate with GPT-4 as judge experiment.evaluate( "follows_instructions", autoeval_binary_scoring, static_eval_fn_kwargs={ "prompt_column_name": "messages", "response_column_name": "response" } ) # View results with instruction-following scores experiment.visualize() # Aggregate by temperature to see if lower temp follows instructions better df = experiment.get_table() print(df.groupby("temperature")["follows_instructions"].mean()) ``` ## Experiment Export Methods Export experiment results in various formats for persistence, analysis, or fine-tuning data preparation. ```python from prompttools.experiment import OpenAIChatExperiment experiment = OpenAIChatExperiment( model=["gpt-3.5-turbo"], messages=[ [{"role": "user", "content": "Translate 'hello' to French."}], [{"role": "user", "content": "Translate 'goodbye' to Spanish."}], ], temperature=[0.0] ) experiment.run() # Export to CSV experiment.to_csv("results.csv", index=False) # Export to JSON json_str = experiment.to_json() # Returns JSON string experiment.to_json("results.json") # Saves to file # Export to pandas DataFrame df = experiment.to_pandas_df(get_all_cols=True) # Export to LoRA format for fine-tuning experiment.to_lora_json( instruction_extract=lambda row: "Translate the following text", input_extract=lambda row: str(row["messages"][-1]["content"]), output_extract="response", path="finetune_data.json" ) # Export to MongoDB experiment.to_mongo_db( mongo_uri="mongodb://localhost:27017/", database_name="experiments", collection_name="translation_tests" ) # Export to Markdown markdown_table = experiment.to_markdown() print(markdown_table) ``` ## Azure OpenAI Service Integration Run experiments against Azure OpenAI Service deployments with custom endpoint configuration. ```python import os from prompttools.experiment import OpenAIChatExperiment # Set Azure credentials os.environ["AZURE_OPENAI_KEY"] = "your-azure-key" # Azure configuration azure_config = { "AZURE_OPENAI_ENDPOINT": "https://your-resource.openai.azure.com/", "API_VERSION": "2023-12-01-preview" } # Use deployment names instead of model names experiment = OpenAIChatExperiment( model=["gpt-35-turbo-deployment", "gpt-4-deployment"], # Deployment names messages=[ [{"role": "user", "content": "Summarize the benefits of cloud computing."}], ], temperature=[0.0, 0.5], azure_openai_service_configs=azure_config ) experiment.run() experiment.visualize() ``` ## Summary PromptTools excels at systematic prompt engineering by enabling developers to test multiple models, prompts, and parameters simultaneously. The core workflow involves creating experiments with parameter lists, running them to generate all combinations, evaluating results with built-in or custom metrics, and exporting data for analysis or fine-tuning. Key use cases include A/B testing prompt templates, comparing model performance, validating structured outputs, regression testing for prompt changes, and preparing fine-tuning datasets from successful responses. The library integrates seamlessly with major LLM providers (OpenAI, Anthropic, Google, Mistral), local models (LLaMA via llama.cpp), and vector databases (ChromaDB, Pinecone, Weaviate, Qdrant). Harnesses provide higher-level abstractions for common patterns like template testing and model comparison, while the `@prompttest` decorator enables CI/CD integration for continuous prompt quality assurance. Results are stored in pandas DataFrames, making it easy to leverage Python's data science ecosystem for deeper analysis and visualization.