Try Live
Add Docs
Rankings
Pricing
Enterprise
Docs
Install
Theme
Install
Docs
Pricing
Enterprise
More...
More...
Try Live
Rankings
Create API Key
Add Docs
Promptfoo
https://github.com/promptfoo/promptfoo
Admin
Promptfoo is a developer-friendly local tool for testing LLM applications, enabling automated
...
Tokens:
1,040,551
Snippets:
6,496
Trust Score:
7.9
Update:
1 week ago
Context
Skills
Chat
Benchmark
93.3
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# Promptfoo Promptfoo is an open-source CLI and library for evaluating and red-teaming LLM applications. It enables developers to systematically test prompts across multiple models, automate quality checks with assertions, and scan for security vulnerabilities before deployment. The tool supports 60+ LLM providers including OpenAI, Anthropic, Google, Azure, AWS Bedrock, and local models like Ollama. The core functionality centers around running evaluations that compare prompt variations, model outputs, and automated assertions to catch regressions and ensure quality. Promptfoo also provides comprehensive red teaming capabilities that automatically generate adversarial inputs to test for jailbreaks, prompt injections, data leakage, and other security vulnerabilities. Results can be viewed in a web UI, exported to various formats, or integrated into CI/CD pipelines. ## CLI: Initialize a Project Initialize a new promptfoo project with example configuration, prompts, and test cases. ```bash # Initialize with a getting-started example npx promptfoo@latest init --example getting-started # Initialize interactively npx promptfoo@latest init # Initialize in a specific directory npx promptfoo@latest init my-project ``` ## CLI: Run Evaluations Execute prompt evaluations against configured providers and test cases, with support for caching, concurrency control, and various output formats. ```bash # Run evaluation with default config (promptfooconfig.yaml) npx promptfoo@latest eval # Run with specific config and output file npx promptfoo@latest eval -c my-config.yaml -o results.json # Run with concurrency and filtering options npx promptfoo@latest eval --max-concurrency 5 --filter-pattern "auth.*" --no-cache # Run with provider override npx promptfoo@latest eval -r openai:gpt-4 anthropic:claude-3-opus # Watch mode for development npx promptfoo@latest eval --watch ``` ## CLI: View Results Launch the web-based UI to visualize and analyze evaluation results. ```bash # Open web viewer for latest results npx promptfoo@latest view # View on a specific port npx promptfoo@latest view -p 3000 # Auto-open browser without confirmation npx promptfoo@latest view -y ``` ## CLI: Red Team Security Scanning Generate adversarial test cases and scan LLM applications for security vulnerabilities including jailbreaks, prompt injections, and data leakage. ```bash # Interactive red team setup (opens web UI) npx promptfoo@latest redteam setup # Run complete red team scan (generate + evaluate) npx promptfoo@latest redteam run # Generate adversarial test cases only npx promptfoo@latest redteam generate --plugins harmful,pii --strategies jailbreak # View red team security report npx promptfoo@latest redteam report # List available plugins npx promptfoo@latest redteam plugins ``` ## YAML Configuration: Basic Evaluation Setup Define prompts, providers, test cases, and assertions in a YAML configuration file for reproducible evaluations. ```yaml # promptfooconfig.yaml description: Translation quality evaluation prompts: - 'Translate the following to {{language}}: {{text}}' - file://prompts/translate-formal.txt providers: - openai:gpt-4 - anthropic:messages:claude-3-opus - ollama:llama3 defaultTest: assert: - type: llm-rubric value: Translation should be natural and fluent tests: - vars: language: French text: Hello world assert: - type: contains value: Bonjour - type: cost threshold: 0.01 - type: latency threshold: 3000 - vars: language: Spanish text: Where is the library? assert: - type: icontains value: biblioteca evaluateOptions: maxConcurrency: 4 repeat: 1 cache: true ``` ## YAML Configuration: Red Team Security Scan Configure red team scans with specific plugins for vulnerability testing and strategies for jailbreak attempts. ```yaml # promptfooconfig.yaml description: Security scan for travel chatbot targets: - id: https label: travel-agent-api config: url: https://api.example.com/chat method: POST headers: Content-Type: application/json Authorization: Bearer {{env.API_KEY}} body: message: '{{prompt}}' redteam: purpose: > Travel planning assistant that helps users book flights and hotels. Users should not access other users' data or internal systems. plugins: - harmful - pii - hijacking - prompt-injection - jailbreak - competitors - sql-injection - shell-injection strategies: - jailbreak - prompt-injection - multilingual - base64 numTests: 10 language: English ``` ## YAML Configuration: Provider Options Configure LLM providers with custom settings including API keys, model parameters, and pricing overrides. ```yaml # promptfooconfig.yaml providers: # Simple provider reference - openai:gpt-4 # Provider with configuration - id: openai:gpt-4-turbo config: temperature: 0.7 max_tokens: 2000 response_format: type: json_object # Azure OpenAI with custom endpoint - id: azureopenai:gpt-4-deployment config: apiHost: https://myinstance.openai.azure.com apiKey: ${AZURE_OPENAI_KEY} # Custom HTTP endpoint - id: https://api.mycompany.com/llm config: headers: Authorization: Bearer ${API_KEY} body: model: custom-model prompt: '{{prompt}}' responseParser: json.choices[0].message.content # Local Ollama model - ollama:chat:llama3:8b # Custom Python provider - file://providers/my_provider.py ``` ## YAML Configuration: Assertions and Metrics Define automated checks on LLM outputs using deterministic, model-graded, and custom assertions. ```yaml # promptfooconfig.yaml tests: - vars: question: What is the capital of France? assert: # Deterministic assertions - type: contains value: Paris - type: not-contains value: London - type: is-json - type: regex value: '^[A-Z].*\.$' # Cost and latency constraints - type: cost threshold: 0.005 - type: latency threshold: 2000 # Model-graded assertions - type: llm-rubric value: Answer should be factually correct and concise provider: openai:gpt-4 # Factuality check - type: factuality value: The capital of France is Paris # Similarity check - type: similar value: Paris is the capital city of France threshold: 0.8 # Custom JavaScript assertion - type: javascript value: | output.length > 10 && output.includes('Paris') # Custom Python assertion - type: python value: file://assertions/check_answer.py ``` ## Node.js API: evaluate() Function Programmatically run evaluations using the Node.js library with full TypeScript support. ```typescript import promptfoo, { EvaluateTestSuite, EvaluateOptions } from 'promptfoo'; const testSuite: EvaluateTestSuite = { prompts: [ 'Summarize this text in {{style}} style: {{text}}', (vars) => `Please provide a ${vars.style} summary of: ${vars.text}`, ], providers: [ 'openai:gpt-4', 'anthropic:messages:claude-3-sonnet', ], tests: [ { vars: { style: 'formal', text: 'The quick brown fox jumps over the lazy dog.', }, assert: [ { type: 'contains', value: 'fox' }, { type: 'javascript', value: (output, context) => ({ pass: output.length < 200, score: output.length < 100 ? 1.0 : 0.5, reason: `Output length: ${output.length}`, }), }, ], }, { vars: { style: 'casual', text: 'Machine learning is transforming industries worldwide.', }, assert: [ { type: 'llm-rubric', value: 'Summary captures the main point' }, ], }, ], writeLatestResults: true, sharing: true, }; const options: EvaluateOptions = { maxConcurrency: 4, showProgressBar: true, cache: true, }; const results = await promptfoo.evaluate(testSuite, options); console.log(`Pass rate: ${results.stats.successes}/${results.stats.successes + results.stats.failures}`); console.log(`Total tokens: ${results.stats.tokenUsage.total}`); if (results.shareableUrl) { console.log(`View results: ${results.shareableUrl}`); } ``` ## Node.js API: loadApiProvider() Function Load and use LLM providers programmatically for custom integrations. ```typescript import { loadApiProvider } from 'promptfoo'; // Load OpenAI provider const openaiProvider = await loadApiProvider('openai:gpt-4'); // Load with custom configuration const azureProvider = await loadApiProvider('azureopenai:gpt-4-deployment', { options: { apiHost: 'https://myinstance.openai.azure.com', apiKey: process.env.AZURE_KEY, }, }); // Call the provider directly const response = await openaiProvider.callApi('What is 2 + 2?'); if (response.error) { console.error('Error:', response.error); } else { console.log('Output:', response.output); console.log('Tokens used:', response.tokenUsage); console.log('Cost:', response.cost); } ``` ## Node.js API: Custom Provider Function Implement custom LLM providers using JavaScript/TypeScript functions. ```typescript import promptfoo from 'promptfoo'; // Custom provider function const myCustomProvider = async (prompt: string, context: { vars: Record<string, any> }) => { const response = await fetch('https://my-llm-api.com/generate', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ prompt, user_context: context.vars, }), }); const data = await response.json(); return { output: data.text, tokenUsage: { total: data.usage.total_tokens, prompt: data.usage.prompt_tokens, completion: data.usage.completion_tokens, }, cost: data.usage.total_tokens * 0.00001, }; }; const results = await promptfoo.evaluate({ prompts: ['Tell me about {{topic}}'], providers: [myCustomProvider, 'openai:gpt-4'], tests: [ { vars: { topic: 'artificial intelligence' } }, { vars: { topic: 'climate change' } }, ], }); ``` ## Node.js API: Red Team Generation Programmatically generate and run red team security tests. ```typescript import promptfoo from 'promptfoo'; // Generate red team test cases const redteamConfig = await promptfoo.redteam.generate({ purpose: 'Customer support chatbot for an e-commerce platform', plugins: ['harmful', 'pii', 'hijacking', 'prompt-injection'], strategies: ['jailbreak', 'base64'], numTests: 5, language: 'English', }); // Run red team evaluation const results = await promptfoo.redteam.run({ providers: [{ id: 'https://api.example.com/chat', config: { method: 'POST', body: { message: '{{prompt}}' }, }, label: 'customer-support-bot', }], redteam: redteamConfig, writeLatestResults: true, }); // Analyze results const vulnerabilities = results.results.filter(r => !r.success); console.log(`Found ${vulnerabilities.length} potential vulnerabilities`); for (const vuln of vulnerabilities) { console.log(`- ${vuln.gradingResult?.reason}`); } ``` ## Python Custom Provider Create custom Python-based providers for specialized LLM integrations. ```python # providers/my_provider.py import json import requests def call_api(prompt: str, options: dict, context: dict) -> dict: """ Custom Python provider for promptfoo. Args: prompt: The prompt string to send to the LLM options: Provider configuration from YAML context: Contains vars and other test context Returns: dict with 'output' and optionally 'tokenUsage', 'cost', 'error' """ api_key = options.get('config', {}).get('apiKey') model = options.get('config', {}).get('model', 'default-model') try: response = requests.post( 'https://my-llm-service.com/v1/completions', headers={ 'Authorization': f'Bearer {api_key}', 'Content-Type': 'application/json', }, json={ 'model': model, 'prompt': prompt, 'max_tokens': 1000, }, timeout=30, ) response.raise_for_status() data = response.json() return { 'output': data['choices'][0]['text'], 'tokenUsage': { 'total': data['usage']['total_tokens'], 'prompt': data['usage']['prompt_tokens'], 'completion': data['usage']['completion_tokens'], }, 'cost': data['usage']['total_tokens'] * 0.00002, } except Exception as e: return {'error': str(e)} def get_transform(output: str, context: dict) -> str: """Optional: Transform output before assertions.""" return output.strip().lower() ``` ```yaml # promptfooconfig.yaml - using Python provider providers: - id: file://providers/my_provider.py config: apiKey: ${MY_API_KEY} model: custom-llm-v2 ``` ## Python Custom Assertion Create custom Python-based assertions for complex validation logic. ```python # assertions/validate_json_response.py import json from typing import Union def get_assert(output: str, context: dict) -> Union[bool, float, dict]: """ Custom assertion to validate JSON responses. Args: output: The LLM output string context: Contains 'vars', 'prompt', 'test', 'provider' Returns: bool: Simple pass/fail float: Score between 0 and 1 dict: Full GradingResult with pass, score, reason """ expected_keys = context.get('test', {}).get('assert', [{}])[0].get('config', {}).get('expected_keys', []) try: data = json.loads(output) missing_keys = [key for key in expected_keys if key not in data] if missing_keys: return { 'pass': False, 'score': len(expected_keys - set(missing_keys)) / len(expected_keys), 'reason': f'Missing required keys: {missing_keys}', } return { 'pass': True, 'score': 1.0, 'reason': 'All required keys present in JSON response', } except json.JSONDecodeError as e: return { 'pass': False, 'score': 0.0, 'reason': f'Invalid JSON: {str(e)}', } ``` ```yaml # promptfooconfig.yaml - using Python assertion tests: - vars: query: Get user profile assert: - type: python value: file://assertions/validate_json_response.py config: expected_keys: ['id', 'name', 'email'] ``` ## HTTP Provider Configuration Configure HTTP-based providers to test custom API endpoints with full request/response control. ```yaml # promptfooconfig.yaml providers: - id: https label: my-chat-api config: url: https://api.example.com/v1/chat/completions method: POST headers: Authorization: Bearer ${API_KEY} Content-Type: application/json X-Request-ID: '{{uuid}}' body: model: gpt-4 messages: - role: system content: You are a helpful assistant. - role: user content: '{{prompt}}' temperature: 0.7 max_tokens: 1000 # Parse response to extract output responseParser: json.choices[0].message.content # Extract session ID for multi-turn conversations sessionParser: headers['x-session-id'] # Transform response before assertions transformResponse: | json.choices[0].message.content.trim() ``` ## Extension Hooks Implement lifecycle hooks for custom setup, teardown, and test modification. ```javascript // extensions/hooks.js module.exports = async function extensionHook(hookName, context) { if (hookName === 'beforeAll') { console.log('Starting test suite:', context.suite.description); // Add dynamic test cases context.suite.tests.push({ vars: { topic: 'generated-topic' }, assert: [{ type: 'contains', value: 'expected' }], }); return context; } if (hookName === 'beforeEach') { // Create session for each test const response = await fetch('https://api.example.com/session', { method: 'POST', }); const { sessionId } = await response.json(); context.test.vars.sessionId = sessionId; return context; } if (hookName === 'afterEach') { // Cleanup session const sessionId = context.test.vars.sessionId; if (sessionId) { await fetch(`https://api.example.com/session/${sessionId}`, { method: 'DELETE', }); } // Log results console.log(`Test: ${context.test.description}`); console.log(`Pass: ${context.result.success}`); console.log(`Score: ${context.result.score}`); } if (hookName === 'afterAll') { console.log(`Completed: ${context.results.length} tests`); console.log(`Pass rate: ${context.results.filter(r => r.success).length}/${context.results.length}`); } }; ``` ```yaml # promptfooconfig.yaml extensions: - file://extensions/hooks.js:extensionHook ``` ## CI/CD Integration with GitHub Actions Integrate promptfoo evaluations into GitHub Actions for automated testing. ```yaml # .github/workflows/llm-eval.yml name: LLM Evaluation on: pull_request: paths: - 'prompts/**' - 'promptfooconfig.yaml' push: branches: [main] jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: '20' - name: Install promptfoo run: npm install -g promptfoo - name: Run evaluation env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: | promptfoo eval --no-progress-bar -o results.json - name: Check pass rate run: | PASS_RATE=$(jq '.stats.successes / (.stats.successes + .stats.failures) * 100' results.json) echo "Pass rate: ${PASS_RATE}%" if (( $(echo "$PASS_RATE < 90" | bc -l) )); then echo "Pass rate below threshold!" exit 1 fi - name: Upload results uses: actions/upload-artifact@v4 with: name: eval-results path: results.json redteam: runs-on: ubuntu-latest if: github.event_name == 'push' && github.ref == 'refs/heads/main' steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: '20' - name: Run red team scan env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: | npx promptfoo@latest redteam run --no-progress-bar npx promptfoo@latest redteam report --output security-report.html - name: Upload security report uses: actions/upload-artifact@v4 with: name: security-report path: security-report.html ``` ## MCP Server Integration Start a Model Context Protocol server to expose promptfoo capabilities to AI agents and development tools. ```bash # Start MCP server with STDIO transport (for Cursor, Claude Desktop) npx promptfoo@latest mcp --transport stdio # Start MCP server with HTTP transport on custom port npx promptfoo@latest mcp --transport http --port 8080 ``` ```json // .cursor/mcp.json or claude_desktop_config.json { "mcpServers": { "promptfoo": { "command": "npx", "args": ["promptfoo@latest", "mcp", "--transport", "stdio"] } } } ``` ## Summary Promptfoo serves as a comprehensive testing and security framework for LLM applications, bridging the gap between development and production deployment. The primary use cases include prompt engineering with systematic A/B testing across models, quality assurance through automated assertions and regression testing, security validation via red team adversarial testing, and continuous integration through CI/CD pipeline integration. Teams use promptfoo to compare model outputs side-by-side, validate JSON schema compliance, check for hallucinations with factuality assertions, and ensure responses meet cost and latency requirements. Integration patterns typically start with YAML configuration files for simple evaluations and evolve to programmatic Node.js usage for complex workflows. Custom providers enable testing proprietary APIs and local models, while extension hooks support stateful testing scenarios like multi-turn conversations. For security-conscious deployments, the red team functionality automatically generates adversarial inputs targeting 50+ vulnerability categories including jailbreaks, prompt injections, PII leakage, and unauthorized data access. Results flow into web-based dashboards for manual review or JSON/CSV exports for automated analysis, making promptfoo suitable for both interactive development and headless CI/CD execution.