Promptfoo (promptfoo/promptfoo)

Promptfoo

https://github.com/promptfoo/promptfoo
Admin
Promptfoo is a developer-friendly local tool for testing LLM applications, enabling automated...

Tokens:1,040,551
Snippets:6,496
Trust Score:7.9
Update:1 week ago
Show doc for...
Context Summary (auto-generated)
Raw
# Promptfoo

Promptfoo is an open-source CLI and library for evaluating and red-teaming LLM applications. It enables developers to systematically test prompts across multiple models, automate quality checks with assertions, and scan for security vulnerabilities before deployment. The tool supports 60+ LLM providers including OpenAI, Anthropic, Google, Azure, AWS Bedrock, and local models like Ollama.

The core functionality centers around running evaluations that compare prompt variations, model outputs, and automated assertions to catch regressions and ensure quality. Promptfoo also provides comprehensive red teaming capabilities that automatically generate adversarial inputs to test for jailbreaks, prompt injections, data leakage, and other security vulnerabilities. Results can be viewed in a web UI, exported to various formats, or integrated into CI/CD pipelines.

## CLI: Initialize a Project

Initialize a new promptfoo project with example configuration, prompts, and test cases.

```bash
# Initialize with a getting-started example
npx promptfoo@latest init --example getting-started

# Initialize interactively
npx promptfoo@latest init

# Initialize in a specific directory
npx promptfoo@latest init my-project
```

## CLI: Run Evaluations

Execute prompt evaluations against configured providers and test cases, with support for caching, concurrency control, and various output formats.

```bash
# Run evaluation with default config (promptfooconfig.yaml)
npx promptfoo@latest eval

# Run with specific config and output file
npx promptfoo@latest eval -c my-config.yaml -o results.json

# Run with concurrency and filtering options
npx promptfoo@latest eval --max-concurrency 5 --filter-pattern "auth.*" --no-cache

# Run with provider override
npx promptfoo@latest eval -r openai:gpt-4 anthropic:claude-3-opus

# Watch mode for development
npx promptfoo@latest eval --watch
```

## CLI: View Results

Launch the web-based UI to visualize and analyze evaluation results.

```bash
# Open web viewer for latest results
npx promptfoo@latest view

# View on a specific port
npx promptfoo@latest view -p 3000

# Auto-open browser without confirmation
npx promptfoo@latest view -y
```

## CLI: Red Team Security Scanning

Generate adversarial test cases and scan LLM applications for security vulnerabilities including jailbreaks, prompt injections, and data leakage.

```bash
# Interactive red team setup (opens web UI)
npx promptfoo@latest redteam setup

# Run complete red team scan (generate + evaluate)
npx promptfoo@latest redteam run

# Generate adversarial test cases only
npx promptfoo@latest redteam generate --plugins harmful,pii --strategies jailbreak

# View red team security report
npx promptfoo@latest redteam report

# List available plugins
npx promptfoo@latest redteam plugins
```

## YAML Configuration: Basic Evaluation Setup

Define prompts, providers, test cases, and assertions in a YAML configuration file for reproducible evaluations.

```yaml
# promptfooconfig.yaml
description: Translation quality evaluation

prompts:
  - 'Translate the following to {{language}}: {{text}}'
  - file://prompts/translate-formal.txt

providers:
  - openai:gpt-4
  - anthropic:messages:claude-3-opus
  - ollama:llama3

defaultTest:
  assert:
    - type: llm-rubric
      value: Translation should be natural and fluent

tests:
  - vars:
      language: French
      text: Hello world
    assert:
      - type: contains
        value: Bonjour
      - type: cost
        threshold: 0.01
      - type: latency
        threshold: 3000
  - vars:
      language: Spanish
      text: Where is the library?
    assert:
      - type: icontains
        value: biblioteca

evaluateOptions:
  maxConcurrency: 4
  repeat: 1
  cache: true
```

## YAML Configuration: Red Team Security Scan

Configure red team scans with specific plugins for vulnerability testing and strategies for jailbreak attempts.

```yaml
# promptfooconfig.yaml
description: Security scan for travel chatbot

targets:
  - id: https
    label: travel-agent-api
    config:
      url: https://api.example.com/chat
      method: POST
      headers:
        Content-Type: application/json
        Authorization: Bearer {{env.API_KEY}}
      body:
        message: '{{prompt}}'

redteam:
  purpose: >
    Travel planning assistant that helps users book flights and hotels.
    Users should not access other users' data or internal systems.

  plugins:
    - harmful
    - pii
    - hijacking
    - prompt-injection
    - jailbreak
    - competitors
    - sql-injection
    - shell-injection

  strategies:
    - jailbreak
    - prompt-injection
    - multilingual
    - base64

  numTests: 10
  language: English
```

## YAML Configuration: Provider Options

Configure LLM providers with custom settings including API keys, model parameters, and pricing overrides.

```yaml
# promptfooconfig.yaml
providers:
  # Simple provider reference
  - openai:gpt-4

  # Provider with configuration
  - id: openai:gpt-4-turbo
    config:
      temperature: 0.7
      max_tokens: 2000
      response_format:
        type: json_object

  # Azure OpenAI with custom endpoint
  - id: azureopenai:gpt-4-deployment
    config:
      apiHost: https://myinstance.openai.azure.com
      apiKey: ${AZURE_OPENAI_KEY}

  # Custom HTTP endpoint
  - id: https://api.mycompany.com/llm
    config:
      headers:
        Authorization: Bearer ${API_KEY}
      body:
        model: custom-model
        prompt: '{{prompt}}'
      responseParser: json.choices[0].message.content

  # Local Ollama model
  - ollama:chat:llama3:8b

  # Custom Python provider
  - file://providers/my_provider.py
```

## YAML Configuration: Assertions and Metrics

Define automated checks on LLM outputs using deterministic, model-graded, and custom assertions.

```yaml
# promptfooconfig.yaml
tests:
  - vars:
      question: What is the capital of France?
    assert:
      # Deterministic assertions
      - type: contains
        value: Paris
      - type: not-contains
        value: London
      - type: is-json
      - type: regex
        value: '^[A-Z].*\.$'

      # Cost and latency constraints
      - type: cost
        threshold: 0.005
      - type: latency
        threshold: 2000

      # Model-graded assertions
      - type: llm-rubric
        value: Answer should be factually correct and concise
        provider: openai:gpt-4

      # Factuality check
      - type: factuality
        value: The capital of France is Paris

      # Similarity check
      - type: similar
        value: Paris is the capital city of France
        threshold: 0.8

      # Custom JavaScript assertion
      - type: javascript
        value: |
          output.length > 10 && output.includes('Paris')

      # Custom Python assertion
      - type: python
        value: file://assertions/check_answer.py
```

## Node.js API: evaluate() Function

Programmatically run evaluations using the Node.js library with full TypeScript support.

```typescript
import promptfoo, { EvaluateTestSuite, EvaluateOptions } from 'promptfoo';

const testSuite: EvaluateTestSuite = {
  prompts: [
    'Summarize this text in {{style}} style: {{text}}',
    (vars) => `Please provide a ${vars.style} summary of: ${vars.text}`,
  ],
  providers: [
    'openai:gpt-4',
    'anthropic:messages:claude-3-sonnet',
  ],
  tests: [
    {
      vars: {
        style: 'formal',
        text: 'The quick brown fox jumps over the lazy dog.',
      },
      assert: [
        { type: 'contains', value: 'fox' },
        {
          type: 'javascript',
          value: (output, context) => ({
            pass: output.length < 200,
            score: output.length < 100 ? 1.0 : 0.5,
            reason: `Output length: ${output.length}`,
          }),
        },
      ],
    },
    {
      vars: {
        style: 'casual',
        text: 'Machine learning is transforming industries worldwide.',
      },
      assert: [
        { type: 'llm-rubric', value: 'Summary captures the main point' },
      ],
    },
  ],
  writeLatestResults: true,
  sharing: true,
};

const options: EvaluateOptions = {
  maxConcurrency: 4,
  showProgressBar: true,
  cache: true,
};

const results = await promptfoo.evaluate(testSuite, options);

console.log(`Pass rate: ${results.stats.successes}/${results.stats.successes + results.stats.failures}`);
console.log(`Total tokens: ${results.stats.tokenUsage.total}`);

if (results.shareableUrl) {
  console.log(`View results: ${results.shareableUrl}`);
}
```

## Node.js API: loadApiProvider() Function

Load and use LLM providers programmatically for custom integrations.

```typescript
import { loadApiProvider } from 'promptfoo';

// Load OpenAI provider
const openaiProvider = await loadApiProvider('openai:gpt-4');

// Load with custom configuration
const azureProvider = await loadApiProvider('azureopenai:gpt-4-deployment', {
  options: {
    apiHost: 'https://myinstance.openai.azure.com',
    apiKey: process.env.AZURE_KEY,
  },
});

// Call the provider directly
const response = await openaiProvider.callApi('What is 2 + 2?');

if (response.error) {
  console.error('Error:', response.error);
} else {
  console.log('Output:', response.output);
  console.log('Tokens used:', response.tokenUsage);
  console.log('Cost:', response.cost);
}
```

## Node.js API: Custom Provider Function

Implement custom LLM providers using JavaScript/TypeScript functions.

```typescript
import promptfoo from 'promptfoo';

// Custom provider function
const myCustomProvider = async (prompt: string, context: { vars: Record<string, any> }) => {
  const response = await fetch('https://my-llm-api.com/generate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      prompt,
      user_context: context.vars,
    }),
  });

  const data = await response.json();

  return {
    output: data.text,
    tokenUsage: {
      total: data.usage.total_tokens,
      prompt: data.usage.prompt_tokens,
      completion: data.usage.completion_tokens,
    },
    cost: data.usage.total_tokens * 0.00001,
  };
};

const results = await promptfoo.evaluate({
  prompts: ['Tell me about {{topic}}'],
  providers: [myCustomProvider, 'openai:gpt-4'],
  tests: [
    { vars: { topic: 'artificial intelligence' } },
    { vars: { topic: 'climate change' } },
  ],
});
```

## Node.js API: Red Team Generation

Programmatically generate and run red team security tests.

```typescript
import promptfoo from 'promptfoo';

// Generate red team test cases
const redteamConfig = await promptfoo.redteam.generate({
  purpose: 'Customer support chatbot for an e-commerce platform',
  plugins: ['harmful', 'pii', 'hijacking', 'prompt-injection'],
  strategies: ['jailbreak', 'base64'],
  numTests: 5,
  language: 'English',
});

// Run red team evaluation
const results = await promptfoo.redteam.run({
  providers: [{
    id: 'https://api.example.com/chat',
    config: {
      method: 'POST',
      body: { message: '{{prompt}}' },
    },
    label: 'customer-support-bot',
  }],
  redteam: redteamConfig,
  writeLatestResults: true,
});

// Analyze results
const vulnerabilities = results.results.filter(r => !r.success);
console.log(`Found ${vulnerabilities.length} potential vulnerabilities`);

for (const vuln of vulnerabilities) {
  console.log(`- ${vuln.gradingResult?.reason}`);
}
```

## Python Custom Provider

Create custom Python-based providers for specialized LLM integrations.

```python
# providers/my_provider.py
import json
import requests

def call_api(prompt: str, options: dict, context: dict) -> dict:
    """
    Custom Python provider for promptfoo.

    Args:
        prompt: The prompt string to send to the LLM
        options: Provider configuration from YAML
        context: Contains vars and other test context

    Returns:
        dict with 'output' and optionally 'tokenUsage', 'cost', 'error'
    """
    api_key = options.get('config', {}).get('apiKey')
    model = options.get('config', {}).get('model', 'default-model')

    try:
        response = requests.post(
            'https://my-llm-service.com/v1/completions',
            headers={
                'Authorization': f'Bearer {api_key}',
                'Content-Type': 'application/json',
            },
            json={
                'model': model,
                'prompt': prompt,
                'max_tokens': 1000,
            },
            timeout=30,
        )
        response.raise_for_status()
        data = response.json()

        return {
            'output': data['choices'][0]['text'],
            'tokenUsage': {
                'total': data['usage']['total_tokens'],
                'prompt': data['usage']['prompt_tokens'],
                'completion': data['usage']['completion_tokens'],
            },
            'cost': data['usage']['total_tokens'] * 0.00002,
        }
    except Exception as e:
        return {'error': str(e)}


def get_transform(output: str, context: dict) -> str:
    """Optional: Transform output before assertions."""
    return output.strip().lower()
```

```yaml
# promptfooconfig.yaml - using Python provider
providers:
  - id: file://providers/my_provider.py
    config:
      apiKey: ${MY_API_KEY}
      model: custom-llm-v2
```

## Python Custom Assertion

Create custom Python-based assertions for complex validation logic.

```python
# assertions/validate_json_response.py
import json
from typing import Union

def get_assert(output: str, context: dict) -> Union[bool, float, dict]:
    """
    Custom assertion to validate JSON responses.

    Args:
        output: The LLM output string
        context: Contains 'vars', 'prompt', 'test', 'provider'

    Returns:
        bool: Simple pass/fail
        float: Score between 0 and 1
        dict: Full GradingResult with pass, score, reason
    """
    expected_keys = context.get('test', {}).get('assert', [{}])[0].get('config', {}).get('expected_keys', [])

    try:
        data = json.loads(output)

        missing_keys = [key for key in expected_keys if key not in data]

        if missing_keys:
            return {
                'pass': False,
                'score': len(expected_keys - set(missing_keys)) / len(expected_keys),
                'reason': f'Missing required keys: {missing_keys}',
            }

        return {
            'pass': True,
            'score': 1.0,
            'reason': 'All required keys present in JSON response',
        }
    except json.JSONDecodeError as e:
        return {
            'pass': False,
            'score': 0.0,
            'reason': f'Invalid JSON: {str(e)}',
        }
```

```yaml
# promptfooconfig.yaml - using Python assertion
tests:
  - vars:
      query: Get user profile
    assert:
      - type: python
        value: file://assertions/validate_json_response.py
        config:
          expected_keys: ['id', 'name', 'email']
```

## HTTP Provider Configuration

Configure HTTP-based providers to test custom API endpoints with full request/response control.

```yaml
# promptfooconfig.yaml
providers:
  - id: https
    label: my-chat-api
    config:
      url: https://api.example.com/v1/chat/completions
      method: POST
      headers:
        Authorization: Bearer ${API_KEY}
        Content-Type: application/json
        X-Request-ID: '{{uuid}}'
      body:
        model: gpt-4
        messages:
          - role: system
            content: You are a helpful assistant.
          - role: user
            content: '{{prompt}}'
        temperature: 0.7
        max_tokens: 1000

      # Parse response to extract output
      responseParser: json.choices[0].message.content

      # Extract session ID for multi-turn conversations
      sessionParser: headers['x-session-id']

      # Transform response before assertions
      transformResponse: |
        json.choices[0].message.content.trim()
```

## Extension Hooks

Implement lifecycle hooks for custom setup, teardown, and test modification.

```javascript
// extensions/hooks.js
module.exports = async function extensionHook(hookName, context) {
  if (hookName === 'beforeAll') {
    console.log('Starting test suite:', context.suite.description);

    // Add dynamic test cases
    context.suite.tests.push({
      vars: { topic: 'generated-topic' },
      assert: [{ type: 'contains', value: 'expected' }],
    });

    return context;
  }

  if (hookName === 'beforeEach') {
    // Create session for each test
    const response = await fetch('https://api.example.com/session', {
      method: 'POST',
    });
    const { sessionId } = await response.json();

    context.test.vars.sessionId = sessionId;
    return context;
  }

  if (hookName === 'afterEach') {
    // Cleanup session
    const sessionId = context.test.vars.sessionId;
    if (sessionId) {
      await fetch(`https://api.example.com/session/${sessionId}`, {
        method: 'DELETE',
      });
    }

    // Log results
    console.log(`Test: ${context.test.description}`);
    console.log(`Pass: ${context.result.success}`);
    console.log(`Score: ${context.result.score}`);
  }

  if (hookName === 'afterAll') {
    console.log(`Completed: ${context.results.length} tests`);
    console.log(`Pass rate: ${context.results.filter(r => r.success).length}/${context.results.length}`);
  }
};
```

```yaml
# promptfooconfig.yaml
extensions:
  - file://extensions/hooks.js:extensionHook
```

## CI/CD Integration with GitHub Actions

Integrate promptfoo evaluations into GitHub Actions for automated testing.

```yaml
# .github/workflows/llm-eval.yml
name: LLM Evaluation

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'promptfooconfig.yaml'
  push:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install promptfoo
        run: npm install -g promptfoo

      - name: Run evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          promptfoo eval --no-progress-bar -o results.json

      - name: Check pass rate
        run: |
          PASS_RATE=$(jq '.stats.successes / (.stats.successes + .stats.failures) * 100' results.json)
          echo "Pass rate: ${PASS_RATE}%"
          if (( $(echo "$PASS_RATE < 90" | bc -l) )); then
            echo "Pass rate below threshold!"
            exit 1
          fi

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: results.json

  redteam:
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Run red team scan
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          npx promptfoo@latest redteam run --no-progress-bar
          npx promptfoo@latest redteam report --output security-report.html

      - name: Upload security report
        uses: actions/upload-artifact@v4
        with:
          name: security-report
          path: security-report.html
```

## MCP Server Integration

Start a Model Context Protocol server to expose promptfoo capabilities to AI agents and development tools.

```bash
# Start MCP server with STDIO transport (for Cursor, Claude Desktop)
npx promptfoo@latest mcp --transport stdio

# Start MCP server with HTTP transport on custom port
npx promptfoo@latest mcp --transport http --port 8080
```

```json
// .cursor/mcp.json or claude_desktop_config.json
{
  "mcpServers": {
    "promptfoo": {
      "command": "npx",
      "args": ["promptfoo@latest", "mcp", "--transport", "stdio"]
    }
  }
}
```

## Summary

Promptfoo serves as a comprehensive testing and security framework for LLM applications, bridging the gap between development and production deployment. The primary use cases include prompt engineering with systematic A/B testing across models, quality assurance through automated assertions and regression testing, security validation via red team adversarial testing, and continuous integration through CI/CD pipeline integration. Teams use promptfoo to compare model outputs side-by-side, validate JSON schema compliance, check for hallucinations with factuality assertions, and ensure responses meet cost and latency requirements.

Integration patterns typically start with YAML configuration files for simple evaluations and evolve to programmatic Node.js usage for complex workflows. Custom providers enable testing proprietary APIs and local models, while extension hooks support stateful testing scenarios like multi-turn conversations. For security-conscious deployments, the red team functionality automatically generates adversarial inputs targeting 50+ vulnerability categories including jailbreaks, prompt injections, PII leakage, and unauthorized data access. Results flow into web-based dashboards for manual review or JSON/CSV exports for automated analysis, making promptfoo suitable for both interactive development and headless CI/CD execution.