# PromptTools

PromptTools is an open-source toolkit created by Hegel AI for testing, experimenting with, and evaluating Large Language Models (LLMs), vector databases, and prompts. It enables developers to systematically test prompts and parameters across different models including OpenAI, Anthropic, Google Gemini, Mistral, LLaMA, and more through familiar interfaces like Python code, Jupyter notebooks, and a local playground.

The library works by creating experiments that take cartesian products of input arguments, executing them against LLM APIs or vector databases, and collecting results into DataFrames for analysis. It provides built-in evaluation functions for semantic similarity, JSON validation, auto-evaluation using GPT-4, and more. Results can be exported to CSV, JSON, MongoDB, or LoRA-format JSON for fine-tuning, making it a complete solution for prompt engineering workflows.

## OpenAIChatExperiment

The `OpenAIChatExperiment` class runs experiments against OpenAI's chat completion API. It accepts lists of parameters and creates all possible combinations, executing each against the API and collecting responses with latency metrics.

```python
import os
from prompttools.experiment import OpenAIChatExperiment

os.environ["OPENAI_API_KEY"] = "your-api-key"

# Define experiment parameters - each should be a list
models = ["gpt-3.5-turbo", "gpt-4"]
messages = [
    [{"role": "user", "content": "Tell me a joke."}],
    [{"role": "user", "content": "Is 17077 a prime number?"}],
]
temperatures = [0.0, 1.0]

# Create and run experiment
experiment = OpenAIChatExperiment(
    model=models,
    messages=messages,
    temperature=temperatures
)
experiment.run()

# View results in a table
experiment.visualize()

# Get results as a pandas DataFrame
df = experiment.get_table(get_all_cols=True)

# Export results to CSV
experiment.to_csv("results.csv")

# Export to JSON
experiment.to_json("results.json")
```

## OpenAIChatExperiment with Function Calling

The experiment supports OpenAI's function calling feature, allowing you to test structured output generation with different functions and parameters.

```python
from prompttools.experiment import OpenAIChatExperiment

# Define functions for the model to use
functions = [
    {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA"
                },
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
        }
    }
]

messages = [
    [{"role": "user", "content": "What's the weather like in Boston?"}],
    [{"role": "user", "content": "Tell me the temperature in Tokyo."}],
]

experiment = OpenAIChatExperiment(
    model=["gpt-3.5-turbo", "gpt-4"],
    messages=messages,
    functions=[functions],
    function_call=[{"name": "get_weather"}],
    temperature=[0.0]
)
experiment.run()
experiment.visualize()
```

## OpenAIChatExperiment.initialize

An alternate way to initialize experiments by separating test parameters (varying) from frozen parameters (constant), without needing to wrap frozen values in lists.

```python
from prompttools.experiment import OpenAIChatExperiment

# Parameters to test - values are lists
test_parameters = {
    "model": ["gpt-3.5-turbo", "gpt-4"],
    "temperature": [0.0, 0.5, 1.0]
}

# Parameters to keep constant - values are NOT lists
messages = [{"role": "user", "content": "Who was the first president?"}]
frozen_parameters = {
    "top_p": 1.0,
    "messages": messages,
    "presence_penalty": 0.0
}

# Initialize and run
experiment = OpenAIChatExperiment.initialize(test_parameters, frozen_parameters)
experiment.run()
experiment.visualize()
```

## AnthropicCompletionExperiment

Test prompts against Anthropic's Claude models with configurable parameters for temperature, token limits, and sampling strategies.

```python
import os
from prompttools.experiment import AnthropicCompletionExperiment
from anthropic import HUMAN_PROMPT, AI_PROMPT

os.environ["ANTHROPIC_API_KEY"] = "your-api-key"

# Format prompts using Anthropic's required format
prompts = [
    f"{HUMAN_PROMPT} What is the capital of France? {AI_PROMPT}",
    f"{HUMAN_PROMPT} Explain quantum computing in simple terms. {AI_PROMPT}",
]

experiment = AnthropicCompletionExperiment(
    model=["claude-2", "claude-instant-1"],
    prompt=prompts,
    max_tokens_to_sample=[500, 1000],
    temperature=[0.0, 0.7]
)
experiment.run()
experiment.visualize()

# Get results as DataFrame
df = experiment.get_table()
print(df[["model", "prompt", "response", "latency"]])
```

## GoogleGeminiChatCompletionExperiment

Run experiments against Google's Gemini models using the Google Generative AI SDK with configurable generation and safety settings.

```python
import os
import google.generativeai as genai
from prompttools.experiment import GoogleGeminiChatCompletionExperiment

# Configure API key
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

# Define prompts
contents = [
    "Explain machine learning to a 10-year-old",
    "Write a haiku about programming",
]

# Optional generation config
generation_config = genai.types.GenerationConfig(
    temperature=0.7,
    max_output_tokens=256
)

experiment = GoogleGeminiChatCompletionExperiment(
    model=["gemini-pro"],
    contents=contents,
    generation_config=[generation_config, None]  # Test with and without config
)
experiment.run()
experiment.visualize()
```

## MistralChatCompletionExperiment

Execute experiments against Mistral AI's chat completion API with support for safety prompts and deterministic generation using random seeds.

```python
import os
from prompttools.experiment import MistralChatCompletionExperiment
from mistralai.models.chat_completion import ChatMessage

os.environ["MISTRAL_API_KEY"] = "your-api-key"

# Create messages using Mistral's ChatMessage format
messages = [
    [ChatMessage(role="user", content="What is the meaning of life?")],
    [ChatMessage(role="user", content="Explain recursion with an example.")],
]

experiment = MistralChatCompletionExperiment(
    model=["mistral-tiny", "mistral-small"],
    messages=messages,
    temperature=[0.3, 0.7],
    max_tokens=[500],
    safe_prompt=[True]  # Enable safety filtering
)
experiment.run()
experiment.visualize()
```

## LlamaCppExperiment

Test local LLaMA models via llama.cpp with full control over model initialization and inference parameters.

```python
from prompttools.experiment import LlamaCppExperiment

# Define model and inference parameters
model_paths = [
    "/path/to/llama-2-7b.gguf",
    "/path/to/llama-2-13b.gguf"
]

prompts = [
    "Write a short poem about coding:",
    "Explain why the sky is blue:",
]

# Model initialization parameters
model_params = {
    "n_ctx": [2048],          # Context window size
    "n_threads": [8],          # CPU threads to use
}

# Inference parameters
call_params = {
    "max_tokens": [256],
    "temperature": [0.7, 1.0],
    "top_p": [0.9],
    "repeat_penalty": [1.1]
}

experiment = LlamaCppExperiment(
    model_path=model_paths,
    prompt=prompts,
    model_params=model_params,
    call_params=call_params
)
experiment.run()
experiment.visualize()
```

## ChromaDBExperiment

Test vector database retrieval with different embedding functions and query parameters using ChromaDB.

```python
import chromadb
from prompttools.experiment import ChromaDBExperiment

# Initialize ChromaDB client
chroma_client = chromadb.Client()

# Documents to add to the collection
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Python is a popular programming language for data science.",
    "Neural networks are inspired by biological neurons.",
    "Deep learning uses multiple layers of neural networks.",
]

# Query parameters to test
query_params = {
    "query_texts": [
        ["What is machine learning?"],
        ["Tell me about Python programming"],
    ],
    "n_results": [2, 3]  # Number of results to retrieve
}

experiment = ChromaDBExperiment(
    chroma_client=chroma_client,
    collection_name="test_collection",
    use_existing_collection=False,  # Create new collection
    query_collection_params=query_params,
    add_to_collection_params={
        "documents": documents,
        "ids": [f"doc_{i}" for i in range(len(documents))]
    }
)
experiment.run()

# View retrieved documents and distances
experiment.visualize()
df = experiment.get_table()
print(df[["query_texts", "top doc ids", "distances", "documents"]])
```

## Experiment.evaluate

Add custom evaluation metrics to experiment results using the evaluate method. Built-in evaluation functions include semantic similarity, JSON validation, and auto-evaluation with GPT-4.

```python
from prompttools.experiment import OpenAIChatExperiment
from prompttools.utils import (
    semantic_similarity,
    validate_json_response,
    autoeval_binary_scoring
)

# Run experiment
experiment = OpenAIChatExperiment(
    model=["gpt-3.5-turbo"],
    messages=[[{"role": "user", "content": "What is 2+2? Answer with just the number."}]],
    temperature=[0.0]
)
experiment.run()

# Evaluate with semantic similarity against expected responses
expected_responses = ["4"]  # One expected response per experiment row
experiment.evaluate(
    "similarity_score",
    semantic_similarity,
    static_eval_fn_kwargs={"response_column_name": "response"},
    expected=expected_responses
)

# Validate JSON format (returns 1.0 if valid, 0.0 if not)
experiment.evaluate(
    "is_valid_json",
    validate_json_response,
    static_eval_fn_kwargs={"response_column_name": "response"}
)

# Auto-evaluate with GPT-4 (returns 1.0 if response follows directions)
experiment.evaluate(
    "follows_directions",
    autoeval_binary_scoring,
    static_eval_fn_kwargs={
        "prompt_column_name": "messages",
        "response_column_name": "response"
    }
)

# View results with evaluation scores
experiment.visualize()
```

## Experiment.aggregate and rank

Aggregate and rank experiment results by specific metrics to identify the best performing models, prompts, or configurations.

```python
from prompttools.experiment import OpenAIChatExperiment
from prompttools.utils import semantic_similarity

# Run experiment with multiple models and prompts
experiment = OpenAIChatExperiment(
    model=["gpt-3.5-turbo", "gpt-4"],
    messages=[
        [{"role": "user", "content": "Explain gravity in one sentence."}],
        [{"role": "user", "content": "What causes rain?"}],
    ],
    temperature=[0.0, 0.5]
)
experiment.run()

# Add evaluation metric
expected = ["Gravity is the force of attraction between masses."] * 4
experiment.evaluate(
    "similarity",
    semantic_similarity,
    static_eval_fn_kwargs={"response_column_name": "response"},
    expected=expected
)

# Aggregate latency by model (compute average latency per model)
experiment.aggregate(
    metric_name="latency",
    column_name="model",
    is_average=True
)

# Rank models by similarity score
rankings = experiment.rank(
    metric_name="similarity",
    is_average=True,
    agg_column="model"
)
print("Model rankings by similarity:", rankings)
# Output: {'gpt-4': 0.92, 'gpt-3.5-turbo': 0.85}
```

## ChatPromptTemplateExperimentationHarness

Use Jinja2 templates to test different prompt structures with variable user inputs, automatically generating all combinations.

```python
from prompttools.experiment import OpenAIChatExperiment
from prompttools.harness import ChatPromptTemplateExperimentationHarness

# Define message templates using Jinja2 syntax
templates = [
    # Template 1: Direct question
    [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "{{question}}"}
    ],
    # Template 2: With context instruction
    [
        {"role": "system", "content": "You are an expert. Be concise and accurate."},
        {"role": "user", "content": "Please answer: {{question}}"}
    ],
]

# User inputs to test with each template
user_inputs = [
    {"question": "What is photosynthesis?"},
    {"question": "How do computers work?"},
]

# Create harness
harness = ChatPromptTemplateExperimentationHarness(
    experiment=OpenAIChatExperiment,
    model_name="gpt-3.5-turbo",
    message_templates=templates,
    user_inputs=user_inputs,
    model_arguments={"temperature": 0.0}  # Frozen parameters
)

# Run and visualize
harness.run()
harness.visualize()

# Aggregate latency by template
latency_by_template = harness.aggregate(
    groupby_column="templates",
    aggregate_columns="latency",
    method="mean"
)
print(latency_by_template)
```

## ExperimentationHarness.evaluate

The harness provides the same evaluation capabilities as experiments, allowing systematic evaluation across template variations.

```python
from prompttools.experiment import OpenAIChatExperiment
from prompttools.harness import ChatPromptTemplateExperimentationHarness
from prompttools.utils import semantic_similarity

templates = [
    [
        {"role": "system", "content": "Answer questions briefly."},
        {"role": "user", "content": "{{query}}"}
    ],
    [
        {"role": "system", "content": "You are a teacher. Explain clearly."},
        {"role": "user", "content": "Student asks: {{query}}"}
    ],
]

user_inputs = [
    {"query": "What is DNA?"},
    {"query": "Why is the ocean salty?"},
]

harness = ChatPromptTemplateExperimentationHarness(
    experiment=OpenAIChatExperiment,
    model_name="gpt-3.5-turbo",
    message_templates=templates,
    user_inputs=user_inputs
)
harness.run()

# Define expected responses for evaluation
expected_responses = [
    "DNA is a molecule containing genetic instructions.",
    "DNA is the molecule that carries genetic information.",
    "Ocean is salty due to dissolved minerals from rocks.",
    "The ocean is salty because of minerals from rivers and rocks."
]

# Evaluate responses
harness.evaluate(
    "quality",
    semantic_similarity,
    static_eval_fn_kwargs={"response_column_name": "response"},
    expected=expected_responses
)

harness.visualize()
```

## prompttest Decorator

Create automated test suites for prompts using the `@prompttest` decorator, enabling CI/CD integration for prompt quality assurance.

```python
import openai
from prompttools import prompttest
from prompttools.utils import semantic_similarity
from prompttools.prompttest.threshold_type import ThresholdType

# Define the completion function to test
def my_completion_fn(prompt: str) -> str:
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Define test prompts and expected responses
test_prompts = [
    "What is 2 + 2?",
    "What is the capital of France?",
]

expected_responses = [
    "4",
    "Paris",
]

# Create a prompt test with semantic similarity evaluation
@prompttest(
    metric_name="similarity",
    eval_fn=semantic_similarity,
    prompts=test_prompts,
    threshold=0.7,  # Minimum similarity score
    threshold_type=ThresholdType.MINIMUM,
    expected=expected_responses
)
def test_basic_qa(prompt: str) -> str:
    return my_completion_fn(prompt)

# Run tests (typically in test file)
if __name__ == "__main__":
    from prompttools.prompttest import main
    main()
    # Output: Running 1 test(s)
    # Output: All 1 test(s) passed!
```

## validate_json_response

Validate that model responses are properly formatted JSON, useful for testing structured output from function calling or JSON mode.

```python
from prompttools.experiment import OpenAIChatExperiment
from prompttools.utils import validate_json_response

# Test JSON generation capability
messages = [
    [{"role": "user", "content": "Return a JSON object with keys 'name' and 'age' for a person named John who is 30."}],
    [{"role": "user", "content": "Create a JSON array of 3 colors."}],
]

experiment = OpenAIChatExperiment(
    model=["gpt-3.5-turbo", "gpt-4"],
    messages=messages,
    response_format=[{"type": "json_object"}],  # Enable JSON mode
    temperature=[0.0]
)
experiment.run()

# Validate JSON format
experiment.evaluate(
    "is_valid_json",
    validate_json_response,
    static_eval_fn_kwargs={"response_column_name": "response"}
)

# View results - is_valid_json will be 1.0 for valid JSON, 0.0 otherwise
df = experiment.get_table()
print(df[["model", "response", "is_valid_json"]])
```

## autoeval_binary_scoring

Use GPT-4 as an automated judge to evaluate whether model responses follow the given directions, returning binary scores.

```python
import os
from prompttools.experiment import OpenAIChatExperiment
from prompttools.utils import autoeval_binary_scoring

os.environ["OPENAI_API_KEY"] = "your-api-key"

# Test instruction following
messages = [
    [{"role": "user", "content": "List exactly 3 fruits, one per line."}],
    [{"role": "user", "content": "Write a haiku about winter."}],
    [{"role": "user", "content": "Respond with only 'yes' or 'no': Is water wet?"}],
]

experiment = OpenAIChatExperiment(
    model=["gpt-3.5-turbo"],
    messages=messages,
    temperature=[0.0, 0.7]
)
experiment.run()

# Auto-evaluate with GPT-4 as judge
experiment.evaluate(
    "follows_instructions",
    autoeval_binary_scoring,
    static_eval_fn_kwargs={
        "prompt_column_name": "messages",
        "response_column_name": "response"
    }
)

# View results with instruction-following scores
experiment.visualize()

# Aggregate by temperature to see if lower temp follows instructions better
df = experiment.get_table()
print(df.groupby("temperature")["follows_instructions"].mean())
```

## Experiment Export Methods

Export experiment results in various formats for persistence, analysis, or fine-tuning data preparation.

```python
from prompttools.experiment import OpenAIChatExperiment

experiment = OpenAIChatExperiment(
    model=["gpt-3.5-turbo"],
    messages=[
        [{"role": "user", "content": "Translate 'hello' to French."}],
        [{"role": "user", "content": "Translate 'goodbye' to Spanish."}],
    ],
    temperature=[0.0]
)
experiment.run()

# Export to CSV
experiment.to_csv("results.csv", index=False)

# Export to JSON
json_str = experiment.to_json()  # Returns JSON string
experiment.to_json("results.json")  # Saves to file

# Export to pandas DataFrame
df = experiment.to_pandas_df(get_all_cols=True)

# Export to LoRA format for fine-tuning
experiment.to_lora_json(
    instruction_extract=lambda row: "Translate the following text",
    input_extract=lambda row: str(row["messages"][-1]["content"]),
    output_extract="response",
    path="finetune_data.json"
)

# Export to MongoDB
experiment.to_mongo_db(
    mongo_uri="mongodb://localhost:27017/",
    database_name="experiments",
    collection_name="translation_tests"
)

# Export to Markdown
markdown_table = experiment.to_markdown()
print(markdown_table)
```

## Azure OpenAI Service Integration

Run experiments against Azure OpenAI Service deployments with custom endpoint configuration.

```python
import os
from prompttools.experiment import OpenAIChatExperiment

# Set Azure credentials
os.environ["AZURE_OPENAI_KEY"] = "your-azure-key"

# Azure configuration
azure_config = {
    "AZURE_OPENAI_ENDPOINT": "https://your-resource.openai.azure.com/",
    "API_VERSION": "2023-12-01-preview"
}

# Use deployment names instead of model names
experiment = OpenAIChatExperiment(
    model=["gpt-35-turbo-deployment", "gpt-4-deployment"],  # Deployment names
    messages=[
        [{"role": "user", "content": "Summarize the benefits of cloud computing."}],
    ],
    temperature=[0.0, 0.5],
    azure_openai_service_configs=azure_config
)
experiment.run()
experiment.visualize()
```

## Summary

PromptTools excels at systematic prompt engineering by enabling developers to test multiple models, prompts, and parameters simultaneously. The core workflow involves creating experiments with parameter lists, running them to generate all combinations, evaluating results with built-in or custom metrics, and exporting data for analysis or fine-tuning. Key use cases include A/B testing prompt templates, comparing model performance, validating structured outputs, regression testing for prompt changes, and preparing fine-tuning datasets from successful responses.

The library integrates seamlessly with major LLM providers (OpenAI, Anthropic, Google, Mistral), local models (LLaMA via llama.cpp), and vector databases (ChromaDB, Pinecone, Weaviate, Qdrant). Harnesses provide higher-level abstractions for common patterns like template testing and model comparison, while the `@prompttest` decorator enables CI/CD integration for continuous prompt quality assurance. Results are stored in pandas DataFrames, making it easy to leverage Python's data science ecosystem for deeper analysis and visualization.