# EvalView EvalView is an open-source behavioral regression detection framework for AI agents. It works like Playwright for tool-calling and multi-turn AI agents, catching silent regressions when prompts, models, or tools change. The framework tracks drift across outputs, tools, model IDs, and runtime fingerprints, distinguishing between provider changes and actual system regressions. EvalView provides golden baseline diffing, auto-healing capabilities, and seamless CI/CD integration. The framework supports multiple AI agent frameworks including LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, HuggingFace, Ollama, MCP servers, and any HTTP API. It uses a 6-dimensional evaluation system with hard-fail safety gates, tool accuracy, output quality (LLM-as-judge), sequence correctness, cost thresholds, and latency thresholds. Tests are defined in YAML format and can be run via CLI or Python API. ## Installation Install EvalView with pip to enable AI agent regression testing. ```bash # Basic installation pip install evalview # With interactive HTML report charts pip install evalview[reports] # With watch mode for live development pip install evalview[watch] # All optional features pip install evalview[all] # Verify installation evalview --version ``` ## Initialize Project Auto-detect your running agent and create a starter test suite with configuration. ```bash # Initialize EvalView in current directory evalview init # Output: # Created .evalview/config.yaml # Created tests/test-cases/example.yaml # Detected agent at http://localhost:8000 # Override auto-detected agent profile evalview init --profile rag # Available profiles: chat, tool-use, multi-step, rag, coding ``` ## Configuration File Configure the adapter, endpoint, and scoring weights in `.evalview/config.yaml`. ```yaml # .evalview/config.yaml adapter: http endpoint: http://localhost:8000/api/chat timeout: 30.0 # Custom HTTP headers headers: Authorization: Bearer ${API_KEY} Content-Type: application/json # Scoring weights (must sum to 1.0) scoring: weights: tool_accuracy: 0.3 output_quality: 0.5 sequence_correctness: 0.2 # Retry configuration for flaky tests retry: max_retries: 2 base_delay: 1.0 ``` ## Write Test Cases Define test cases in YAML format with expected tools, output assertions, and thresholds. ```yaml # tests/test-cases/weather-query.yaml name: "weather-query-test" description: "Test that the agent fetches weather information correctly" input: query: "What's the weather in San Francisco?" context: user_id: "test-user-123" expected: tools: - get_weather - format_response tool_sequence: - get_weather - format_response forbidden_tools: - delete_user - admin_override output: contains: - "San Francisco" - "temperature" not_contains: - "error" - "I don't know" regex_patterns: - "\\d+°[CF]" json_schema: type: object required: - location - temperature thresholds: min_score: 80 max_cost: 0.05 max_latency: 5000 ``` ## Multi-Turn Conversation Tests Test multi-step conversations with per-turn assertions and context tracking. ```yaml # tests/test-cases/booking-flow.yaml name: "flight-booking-flow" description: "Test multi-turn flight booking conversation" turns: - query: "I want to book a flight to Paris" expected: tools: [search_flights] output: contains: ["Paris", "available flights"] - query: "Book the cheapest option" expected: tools: [book_flight, confirm_booking] forbidden_tools: [delete_booking, refund] output: contains: ["confirmation", "booked"] not_contains: ["error", "failed"] - query: "What's my booking status?" expected: tools: [get_booking_status] output: contains: ["confirmed", "Paris"] thresholds: min_score: 75 max_cost: 0.25 max_latency: 15000 ``` ## Run Tests Execute test cases with various options for parallel execution, retries, and output formats. ```bash # Run all tests (parallel by default) evalview run # Run specific tests by name evalview run -t "weather-query" -t "booking-flow" # Run with pattern matching evalview run --pattern "tests/api/*.yaml" # Sequential execution with retries evalview run --sequential --max-retries 3 # Generate HTML report evalview run --html-report report.html # Verbose output with coverage info evalview run --verbose --coverage # Dry run to preview plan and estimate cost evalview run --dry-run # Set budget limit evalview run --budget 1.00 # Skip LLM judge for free deterministic-only scoring evalview run --no-judge ``` ## Golden Baseline Snapshots Save current behavior as baseline and detect regressions in future runs. ```bash # Save all passing tests as golden baseline evalview snapshot # Snapshot specific test evalview snapshot --test "weather-query" # Save alternate valid behavior as variant evalview snapshot --variant v2 --notes "Post-refactor baseline" # Preview what would change without saving evalview snapshot --preview # List all saved baselines evalview snapshot list # Show details of a specific baseline evalview snapshot show "weather-query" # Delete a baseline evalview snapshot delete "weather-query" --force ``` ## Check for Regressions Compare current behavior against saved baselines to detect regressions. ```bash # Check all tests against baselines evalview check # Check specific test evalview check --test "weather-query" # JSON output for CI integration evalview check --json # Configure which statuses fail the check evalview check --fail-on REGRESSION evalview check --fail-on REGRESSION,TOOLS_CHANGED # Strictest mode - fail on any change evalview check --strict # Enable semantic similarity comparison evalview check --semantic-diff # Auto-heal flaky failures with retries evalview check --heal # Set budget limit evalview check --budget 0.50 ``` ## Watch Mode Automatically re-run regression checks on file changes during development. ```bash # Watch current directory evalview watch # Quick mode - no LLM judge, $0, sub-second evalview watch --quick # Watch specific test only evalview watch --test "refund-flow" # Output example: # ╭─────────────────── EvalView Watch ───────────────────╮ # │ Watching . │ # │ Tests all in tests/ │ # │ Mode quick (no judge, $0) │ # ╰───────────────────────────────────────────────────────╯ # # 14:32:07 Change detected: src/agent.py # ████████████████████░░░░ 4 passed · 1 tools changed ``` ## Production Monitoring Continuously monitor your agent in production with Slack alerts. ```bash # Monitor every 5 minutes (default) evalview monitor # Live terminal dashboard evalview monitor --dashboard # With Slack webhook for alerts evalview monitor --slack-webhook https://hooks.slack.com/services/... # Export history for dashboards evalview monitor --history monitor.jsonl # Custom interval (in seconds) evalview monitor --interval 300 ``` ## Python API - gate() Use EvalView programmatically without CLI for integration into agent workflows. ```python from evalview import gate, gate_async, DiffStatus # Synchronous regression check result = gate(test_dir="tests/") # Check if all tests passed if result.passed: print("All tests passed!") else: print(f"Status: {result.status.value}") print(f"Regressions: {result.summary.regressions}") print(f"Tools changed: {result.summary.tools_changed}") # Iterate through individual test results for diff in result.diffs: print(f"{diff.test_name}: {diff.status.value}") print(f" Score delta: {diff.score_delta}") print(f" Tool changes: {diff.tool_changes}") if diff.output_similarity: print(f" Output similarity: {diff.output_similarity:.1%}") # Quick mode - no LLM judge, $0, sub-second result = gate(test_dir="tests/", quick=True) # Custom fail conditions result = gate( test_dir="tests/", fail_on={DiffStatus.REGRESSION, DiffStatus.TOOLS_CHANGED}, judge_model="gpt-5", semantic_diff=True, timeout=60.0 ) # Async variant for agent frameworks async def check_regressions(): result = await gate_async(test_dir="tests/", quick=True) return result.passed ``` ## Python API - gate_or_revert() Auto-revert code changes when regression is detected in autonomous loops. ```python from evalview.openclaw import gate_or_revert # Make a code change make_code_change() # Check and auto-revert if regression detected if not gate_or_revert("tests/", quick=True): # Change was reverted - try a different approach print("Regression detected, change reverted") try_alternative_approach() else: print("Change passed regression check") commit_changes() ``` ## LLM-as-Judge Configuration Configure which LLM provider and model to use for output quality evaluation. ```bash # Use OpenAI GPT evalview run --judge-model gpt-5 --judge-provider openai # Use Anthropic Claude evalview run --judge-model sonnet --judge-provider anthropic # Use HuggingFace (free) evalview run --judge-model llama-70b --judge-provider huggingface # Use local Ollama (free) evalview run --judge-model llama3.2 --judge-provider ollama # Use DeepSeek evalview run --judge-model deepseek-chat --judge-provider deepseek # Skip LLM judge entirely (deterministic scoring only) evalview run --no-judge ``` ```yaml # Model shortcuts in config # gpt-5 -> gpt-5 # sonnet -> claude-sonnet-4-5-20250929 # opus -> claude-opus-4-5-20251101 # llama-70b -> meta-llama/Llama-3.1-70B-Instruct # gemini -> gemini-3.0 ``` ## GitHub Actions Integration Add regression checks to CI/CD pipeline with automatic PR comments and artifacts. ```yaml # .github/workflows/evalview.yml name: EvalView Agent Check on: [pull_request, push] jobs: agent-check: runs-on: ubuntu-latest permissions: pull-requests: write steps: - uses: actions/checkout@v4 - name: Check for agent regressions id: evalview uses: hidai25/eval-view@v0.6.0 with: openai-api-key: ${{ secrets.OPENAI_API_KEY }} fail-on: REGRESSION strict: 'false' diff: 'true' - name: Upload report if: always() uses: actions/upload-artifact@v4 with: name: evalview-report path: evalview-report.html - name: Post PR comment if: github.event_name == 'pull_request' && always() env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} run: | pip install evalview -q evalview ci comment --results ${{ steps.evalview.outputs.results-file }} ``` ## Git Hooks Integration Install pre-push hooks for local regression blocking without CI configuration. ```bash # Install pre-push hook evalview install-hooks # Now every git push runs regression checks first git push origin main # Running evalview check... # ✓ 5/5 tests passed # Pushing to origin/main... # Uninstall hooks evalview uninstall-hooks ``` ## Test Generation Generate test suites automatically from live agent probing or traffic logs. ```bash # Generate tests from live agent evalview generate --agent http://localhost:8000 # Generate from existing logs evalview generate --from-log traffic.jsonl # Limit number of tests and focus on specific tools evalview generate --agent http://localhost:8000 \ --budget 20 \ --include-tools search,calendar # Preview without writing files evalview generate --dry-run # Approve generated tests as baselines evalview snapshot tests/generated --approve-generated ``` ## Traffic Capture Capture real agent interactions and auto-generate test cases with assertion wizard. ```bash # Start proxy to capture traffic evalview capture --agent http://localhost:8000/invoke # Use your agent normally through the proxy # Press Ctrl+C to stop # Capture multi-turn conversations as single tests evalview capture --agent http://localhost:8000/invoke --multi-turn # Custom proxy port evalview capture --agent http://localhost:8000/invoke --port 8092 # Assertion wizard analyzes captured traffic: # Assertion Wizard — analyzing 8 captured interactions # Agent type detected: multi-step # Tools seen: search, extract, summarize # Suggested assertions: # 1. Lock tool sequence: search -> extract -> summarize # 2. Require tools: search, extract, summarize # 3. Max latency: 5000ms # Accept all recommended? [Y/n]: y ``` ## Statistical Mode Handle non-deterministic LLM outputs with pass@k reliability metrics. ```bash # Run each test 10 times evalview run --runs 10 # Custom pass rate requirement evalview run --runs 10 --pass-rate 0.7 # Filter by difficulty level evalview run --difficulty hard --runs 5 ``` ```yaml # Per-test statistical configuration name: "flaky-agent-test" input: query: "Analyze market trends" expected: tools: [fetch_data, analyze] thresholds: min_score: 70 variance: runs: 10 pass_rate: 0.8 min_mean_score: 70 max_std_dev: 15 ``` ## Auto-Variant Discovery Automatically discover and save valid execution path variants for non-deterministic agents. ```bash # Run 10 times, cluster paths, save valid variants evalview check --statistical 10 --auto-variant # Output: # search-flow mean: 82.3, std: 8.1, flakiness: low_variance # 1. search -> extract -> summarize (7/10 runs) # 2. search -> summarize (3/10 runs) # Save as golden variant? [Y/n]: y # Saved variant 'auto-v1': search -> summarize ``` ## MCP Contract Testing Detect when external MCP servers change their interface before your agent breaks. ```bash # Snapshot MCP server tool definitions evalview mcp snapshot "npx:@modelcontextprotocol/server-github" --name server-github # Check for interface drift evalview mcp check server-github # Output on drift: # CONTRACT_DRIFT - 2 breaking change(s) # REMOVED: create_pull_request # CHANGED: list_issues - new required parameter 'owner' # List all contracts evalview mcp list # Show contract details evalview mcp show server-github # Run tests with contract pre-flight check evalview run tests/ --contracts --fail-on "REGRESSION,CONTRACT_DRIFT" ``` ## Claude Code MCP Integration Use EvalView as an MCP server inside Claude Code for inline regression checks. ```bash # Add EvalView as MCP server to Claude Code claude mcp add --transport stdio evalview -- evalview mcp serve # Available MCP tools: # - create_test: Generate test cases from natural language # - run_snapshot: Save current behavior as baseline # - run_check: Check for regressions # - list_tests: List all test cases # - validate_skill: Validate skill structure # - generate_skill_tests: Generate behavior tests # - run_skill_test: Run skill tests # - generate_visual_report: Create HTML report # Copy example CLAUDE.md for proactive checks cp CLAUDE.md.example CLAUDE.md ``` ## Custom Adapter Implementation Build custom adapters for proprietary agent frameworks. ```python # evalview/adapters/my_adapter.py from datetime import datetime from typing import Any, Dict, Optional from evalview.adapters.base import AgentAdapter from evalview.core.types import ( ExecutionTrace, StepTrace, StepMetrics, ExecutionMetrics ) class MyCustomAdapter(AgentAdapter): """Adapter for custom agent framework.""" def __init__( self, endpoint: str, headers: Optional[Dict[str, str]] = None, timeout: float = 30.0 ): self.endpoint = endpoint self.headers = headers or {} self.timeout = timeout @property def name(self) -> str: return "my-adapter" async def execute( self, query: str, context: Optional[Dict[str, Any]] = None ) -> ExecutionTrace: import httpx start_time = datetime.now() async with httpx.AsyncClient(timeout=self.timeout) as client: response = await client.post( self.endpoint, json={"query": query, **(context or {})}, headers=self.headers ) response.raise_for_status() data = response.json() steps = [ StepTrace( step_id=step["id"], step_name=step["name"], tool_name=step.get("tool"), parameters=step.get("parameters", {}), output=step.get("output"), success=step.get("success", True), metrics=StepMetrics( latency=step.get("latency", 0.0), cost=step.get("cost", 0.0), tokens=step.get("tokens") ) ) for step in data.get("steps", []) ] end_time = datetime.now() return ExecutionTrace( session_id=data.get("session_id", "custom"), start_time=start_time, end_time=end_time, steps=steps, final_output=data.get("output", ""), metrics=ExecutionMetrics( total_cost=sum(s.metrics.cost for s in steps), total_latency=(end_time - start_time).total_seconds() * 1000, total_tokens=sum(s.metrics.tokens or 0 for s in steps) ) ) async def health_check(self) -> bool: try: import httpx async with httpx.AsyncClient(timeout=5.0) as client: response = await client.get(self.endpoint) return response.status_code == 200 except Exception: return False ``` ## LangGraph Adapter Configuration Configure EvalView to test LangGraph agents with tool call capture. ```yaml # .evalview/config.yaml adapter: langgraph endpoint: http://localhost:2024 assistant_id: agent timeout: 90 ``` ```yaml # tests/langgraph-research.yaml name: "multi-step-research" adapter: langgraph endpoint: http://localhost:2024 input: query: "What are the latest AI trends?" context: assistant_id: agent expected: tools: - tavily_search - summarize tool_sequence: - tavily_search - summarize output: contains: - "AI" - "trends" thresholds: min_score: 70 max_cost: 0.10 max_latency: 30000 ``` ## A/B Endpoint Comparison Compare the same test suite against two different agent endpoints. ```bash # Compare two endpoints evalview compare \ --v1 http://localhost:8000 \ --v2 http://localhost:8001 \ --label-v1 "production" \ --label-v2 "staging" # Skip LLM judge for fast comparison evalview compare --v1 prod --v2 staging --no-judge # Output: # ┌─────────────────┬───────────┬───────────┬────────┬───────────┐ # │ Test │ Prod │ Staging │ Δ │ Verdict │ # ├─────────────────┼───────────┼───────────┼────────┼───────────┤ # │ search-flow │ 92.5 │ 94.0 │ +1.5 │ improved │ # │ refund-flow │ 88.0 │ 71.0 │ -17.0 │ degraded │ # └─────────────────┴───────────┴───────────┴────────┴───────────┘ # Overall: CAUTION - investigate degraded tests before promoting ``` ## Pytest Plugin Use EvalView fixtures in standard pytest test suites. ```python # test_agent.py import pytest def test_weather_regression(evalview_check): """Test weather agent hasn't regressed.""" diff = evalview_check("weather-lookup") assert diff.overall_severity.value != "regression", diff.summary() def test_booking_flow(evalview_check): """Test booking flow maintains expected behavior.""" diff = evalview_check("booking-flow") assert diff.passed, f"Booking flow failed: {diff.summary()}" assert diff.tool_changes == 0, "Unexpected tool changes" # Run with pytest # pip install evalview # Plugin registers automatically # pytest test_agent.py -v ``` ## Environment Variables Configure EvalView via environment variables for CI and production use. ```bash # LLM API keys for judge evaluation export OPENAI_API_KEY="sk-..." export ANTHROPIC_API_KEY="sk-ant-..." export HF_TOKEN="hf_..." # Force specific judge provider export EVAL_PROVIDER="openai" # openai, anthropic, huggingface, ollama export EVAL_MODEL="gpt-5" # Slack webhook for monitoring alerts export EVALVIEW_SLACK_WEBHOOK="https://hooks.slack.com/services/..." # Agent endpoint export AGENT_ENDPOINT="http://localhost:8000" # Skills testing provider export SKILL_TEST_PROVIDER="anthropic" export SKILL_TEST_API_KEY="sk-..." ``` EvalView is the go-to solution for teams building production AI agents who need confidence that prompt changes, model updates, and tool modifications don't silently break existing behavior. It integrates into any workflow through CLI commands, Python API, CI/CD pipelines, or MCP servers. The framework excels at catching the types of regressions that traditional tests miss: tool choice changes, output quality degradation, cost spikes, and latency increases. Common integration patterns include using `evalview check` in CI to block PRs with regressions, `evalview watch` during development for instant feedback, `evalview monitor` in production for continuous validation with Slack alerts, and the Python `gate()` API for programmatic regression gates in autonomous agent loops. The golden baseline system makes it easy to capture known-good behavior and detect drift over time, while statistical mode and auto-variant discovery handle the inherent non-determinism of LLM-based agents.