# EvalView

EvalView is an open-source behavioral regression detection framework for AI agents. It works like Playwright for tool-calling and multi-turn AI agents, catching silent regressions when prompts, models, or tools change. The framework tracks drift across outputs, tools, model IDs, and runtime fingerprints, distinguishing between provider changes and actual system regressions. EvalView provides golden baseline diffing, auto-healing capabilities, and seamless CI/CD integration.

The framework supports multiple AI agent frameworks including LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, HuggingFace, Ollama, MCP servers, and any HTTP API. It uses a 6-dimensional evaluation system with hard-fail safety gates, tool accuracy, output quality (LLM-as-judge), sequence correctness, cost thresholds, and latency thresholds. Tests are defined in YAML format and can be run via CLI or Python API.

## Installation

Install EvalView with pip to enable AI agent regression testing.

```bash
# Basic installation
pip install evalview

# With interactive HTML report charts
pip install evalview[reports]

# With watch mode for live development
pip install evalview[watch]

# All optional features
pip install evalview[all]

# Verify installation
evalview --version
```

## Initialize Project

Auto-detect your running agent and create a starter test suite with configuration.

```bash
# Initialize EvalView in current directory
evalview init

# Output:
# Created .evalview/config.yaml
# Created tests/test-cases/example.yaml
# Detected agent at http://localhost:8000

# Override auto-detected agent profile
evalview init --profile rag

# Available profiles: chat, tool-use, multi-step, rag, coding
```

## Configuration File

Configure the adapter, endpoint, and scoring weights in `.evalview/config.yaml`.

```yaml
# .evalview/config.yaml
adapter: http
endpoint: http://localhost:8000/api/chat
timeout: 30.0

# Custom HTTP headers
headers:
  Authorization: Bearer ${API_KEY}
  Content-Type: application/json

# Scoring weights (must sum to 1.0)
scoring:
  weights:
    tool_accuracy: 0.3
    output_quality: 0.5
    sequence_correctness: 0.2

# Retry configuration for flaky tests
retry:
  max_retries: 2
  base_delay: 1.0
```

## Write Test Cases

Define test cases in YAML format with expected tools, output assertions, and thresholds.

```yaml
# tests/test-cases/weather-query.yaml
name: "weather-query-test"
description: "Test that the agent fetches weather information correctly"

input:
  query: "What's the weather in San Francisco?"
  context:
    user_id: "test-user-123"

expected:
  tools:
    - get_weather
    - format_response
  tool_sequence:
    - get_weather
    - format_response
  forbidden_tools:
    - delete_user
    - admin_override
  output:
    contains:
      - "San Francisco"
      - "temperature"
    not_contains:
      - "error"
      - "I don't know"
    regex_patterns:
      - "\\d+°[CF]"
    json_schema:
      type: object
      required:
        - location
        - temperature

thresholds:
  min_score: 80
  max_cost: 0.05
  max_latency: 5000
```

## Multi-Turn Conversation Tests

Test multi-step conversations with per-turn assertions and context tracking.

```yaml
# tests/test-cases/booking-flow.yaml
name: "flight-booking-flow"
description: "Test multi-turn flight booking conversation"

turns:
  - query: "I want to book a flight to Paris"
    expected:
      tools: [search_flights]
      output:
        contains: ["Paris", "available flights"]

  - query: "Book the cheapest option"
    expected:
      tools: [book_flight, confirm_booking]
      forbidden_tools: [delete_booking, refund]
      output:
        contains: ["confirmation", "booked"]
        not_contains: ["error", "failed"]

  - query: "What's my booking status?"
    expected:
      tools: [get_booking_status]
      output:
        contains: ["confirmed", "Paris"]

thresholds:
  min_score: 75
  max_cost: 0.25
  max_latency: 15000
```

## Run Tests

Execute test cases with various options for parallel execution, retries, and output formats.

```bash
# Run all tests (parallel by default)
evalview run

# Run specific tests by name
evalview run -t "weather-query" -t "booking-flow"

# Run with pattern matching
evalview run --pattern "tests/api/*.yaml"

# Sequential execution with retries
evalview run --sequential --max-retries 3

# Generate HTML report
evalview run --html-report report.html

# Verbose output with coverage info
evalview run --verbose --coverage

# Dry run to preview plan and estimate cost
evalview run --dry-run

# Set budget limit
evalview run --budget 1.00

# Skip LLM judge for free deterministic-only scoring
evalview run --no-judge
```

## Golden Baseline Snapshots

Save current behavior as baseline and detect regressions in future runs.

```bash
# Save all passing tests as golden baseline
evalview snapshot

# Snapshot specific test
evalview snapshot --test "weather-query"

# Save alternate valid behavior as variant
evalview snapshot --variant v2 --notes "Post-refactor baseline"

# Preview what would change without saving
evalview snapshot --preview

# List all saved baselines
evalview snapshot list

# Show details of a specific baseline
evalview snapshot show "weather-query"

# Delete a baseline
evalview snapshot delete "weather-query" --force
```

## Check for Regressions

Compare current behavior against saved baselines to detect regressions.

```bash
# Check all tests against baselines
evalview check

# Check specific test
evalview check --test "weather-query"

# JSON output for CI integration
evalview check --json

# Configure which statuses fail the check
evalview check --fail-on REGRESSION
evalview check --fail-on REGRESSION,TOOLS_CHANGED

# Strictest mode - fail on any change
evalview check --strict

# Enable semantic similarity comparison
evalview check --semantic-diff

# Auto-heal flaky failures with retries
evalview check --heal

# Set budget limit
evalview check --budget 0.50
```

## Watch Mode

Automatically re-run regression checks on file changes during development.

```bash
# Watch current directory
evalview watch

# Quick mode - no LLM judge, $0, sub-second
evalview watch --quick

# Watch specific test only
evalview watch --test "refund-flow"

# Output example:
# ╭─────────────────── EvalView Watch ───────────────────╮
# │   Watching   .                                        │
# │   Tests      all in tests/                            │
# │   Mode       quick (no judge, $0)                     │
# ╰───────────────────────────────────────────────────────╯
#
# 14:32:07  Change detected: src/agent.py
# ████████████████████░░░░  4 passed · 1 tools changed
```

## Production Monitoring

Continuously monitor your agent in production with Slack alerts.

```bash
# Monitor every 5 minutes (default)
evalview monitor

# Live terminal dashboard
evalview monitor --dashboard

# With Slack webhook for alerts
evalview monitor --slack-webhook https://hooks.slack.com/services/...

# Export history for dashboards
evalview monitor --history monitor.jsonl

# Custom interval (in seconds)
evalview monitor --interval 300
```

## Python API - gate()

Use EvalView programmatically without CLI for integration into agent workflows.

```python
from evalview import gate, gate_async, DiffStatus

# Synchronous regression check
result = gate(test_dir="tests/")

# Check if all tests passed
if result.passed:
    print("All tests passed!")
else:
    print(f"Status: {result.status.value}")
    print(f"Regressions: {result.summary.regressions}")
    print(f"Tools changed: {result.summary.tools_changed}")

# Iterate through individual test results
for diff in result.diffs:
    print(f"{diff.test_name}: {diff.status.value}")
    print(f"  Score delta: {diff.score_delta}")
    print(f"  Tool changes: {diff.tool_changes}")
    if diff.output_similarity:
        print(f"  Output similarity: {diff.output_similarity:.1%}")

# Quick mode - no LLM judge, $0, sub-second
result = gate(test_dir="tests/", quick=True)

# Custom fail conditions
result = gate(
    test_dir="tests/",
    fail_on={DiffStatus.REGRESSION, DiffStatus.TOOLS_CHANGED},
    judge_model="gpt-5",
    semantic_diff=True,
    timeout=60.0
)

# Async variant for agent frameworks
async def check_regressions():
    result = await gate_async(test_dir="tests/", quick=True)
    return result.passed
```

## Python API - gate_or_revert()

Auto-revert code changes when regression is detected in autonomous loops.

```python
from evalview.openclaw import gate_or_revert

# Make a code change
make_code_change()

# Check and auto-revert if regression detected
if not gate_or_revert("tests/", quick=True):
    # Change was reverted - try a different approach
    print("Regression detected, change reverted")
    try_alternative_approach()
else:
    print("Change passed regression check")
    commit_changes()
```

## LLM-as-Judge Configuration

Configure which LLM provider and model to use for output quality evaluation.

```bash
# Use OpenAI GPT
evalview run --judge-model gpt-5 --judge-provider openai

# Use Anthropic Claude
evalview run --judge-model sonnet --judge-provider anthropic

# Use HuggingFace (free)
evalview run --judge-model llama-70b --judge-provider huggingface

# Use local Ollama (free)
evalview run --judge-model llama3.2 --judge-provider ollama

# Use DeepSeek
evalview run --judge-model deepseek-chat --judge-provider deepseek

# Skip LLM judge entirely (deterministic scoring only)
evalview run --no-judge
```

```yaml
# Model shortcuts in config
# gpt-5       -> gpt-5
# sonnet      -> claude-sonnet-4-5-20250929
# opus        -> claude-opus-4-5-20251101
# llama-70b   -> meta-llama/Llama-3.1-70B-Instruct
# gemini      -> gemini-3.0
```

## GitHub Actions Integration

Add regression checks to CI/CD pipeline with automatic PR comments and artifacts.

```yaml
# .github/workflows/evalview.yml
name: EvalView Agent Check
on: [pull_request, push]

jobs:
  agent-check:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4

      - name: Check for agent regressions
        id: evalview
        uses: hidai25/eval-view@v0.6.0
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          fail-on: REGRESSION
          strict: 'false'
          diff: 'true'

      - name: Upload report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: evalview-report
          path: evalview-report.html

      - name: Post PR comment
        if: github.event_name == 'pull_request' && always()
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          pip install evalview -q
          evalview ci comment --results ${{ steps.evalview.outputs.results-file }}
```

## Git Hooks Integration

Install pre-push hooks for local regression blocking without CI configuration.

```bash
# Install pre-push hook
evalview install-hooks

# Now every git push runs regression checks first
git push origin main
# Running evalview check...
# ✓ 5/5 tests passed
# Pushing to origin/main...

# Uninstall hooks
evalview uninstall-hooks
```

## Test Generation

Generate test suites automatically from live agent probing or traffic logs.

```bash
# Generate tests from live agent
evalview generate --agent http://localhost:8000

# Generate from existing logs
evalview generate --from-log traffic.jsonl

# Limit number of tests and focus on specific tools
evalview generate --agent http://localhost:8000 \
  --budget 20 \
  --include-tools search,calendar

# Preview without writing files
evalview generate --dry-run

# Approve generated tests as baselines
evalview snapshot tests/generated --approve-generated
```

## Traffic Capture

Capture real agent interactions and auto-generate test cases with assertion wizard.

```bash
# Start proxy to capture traffic
evalview capture --agent http://localhost:8000/invoke
# Use your agent normally through the proxy
# Press Ctrl+C to stop

# Capture multi-turn conversations as single tests
evalview capture --agent http://localhost:8000/invoke --multi-turn

# Custom proxy port
evalview capture --agent http://localhost:8000/invoke --port 8092

# Assertion wizard analyzes captured traffic:
# Assertion Wizard — analyzing 8 captured interactions
#   Agent type detected: multi-step
#   Tools seen: search, extract, summarize
#   Suggested assertions:
#     1. Lock tool sequence: search -> extract -> summarize
#     2. Require tools: search, extract, summarize
#     3. Max latency: 5000ms
#   Accept all recommended? [Y/n]: y
```

## Statistical Mode

Handle non-deterministic LLM outputs with pass@k reliability metrics.

```bash
# Run each test 10 times
evalview run --runs 10

# Custom pass rate requirement
evalview run --runs 10 --pass-rate 0.7

# Filter by difficulty level
evalview run --difficulty hard --runs 5
```

```yaml
# Per-test statistical configuration
name: "flaky-agent-test"
input:
  query: "Analyze market trends"
expected:
  tools: [fetch_data, analyze]
thresholds:
  min_score: 70
  variance:
    runs: 10
    pass_rate: 0.8
    min_mean_score: 70
    max_std_dev: 15
```

## Auto-Variant Discovery

Automatically discover and save valid execution path variants for non-deterministic agents.

```bash
# Run 10 times, cluster paths, save valid variants
evalview check --statistical 10 --auto-variant

# Output:
# search-flow  mean: 82.3, std: 8.1, flakiness: low_variance
#   1. search -> extract -> summarize  (7/10 runs)
#   2. search -> summarize             (3/10 runs)
# Save as golden variant? [Y/n]: y
# Saved variant 'auto-v1': search -> summarize
```

## MCP Contract Testing

Detect when external MCP servers change their interface before your agent breaks.

```bash
# Snapshot MCP server tool definitions
evalview mcp snapshot "npx:@modelcontextprotocol/server-github" --name server-github

# Check for interface drift
evalview mcp check server-github

# Output on drift:
# CONTRACT_DRIFT - 2 breaking change(s)
#   REMOVED: create_pull_request
#   CHANGED: list_issues - new required parameter 'owner'

# List all contracts
evalview mcp list

# Show contract details
evalview mcp show server-github

# Run tests with contract pre-flight check
evalview run tests/ --contracts --fail-on "REGRESSION,CONTRACT_DRIFT"
```

## Claude Code MCP Integration

Use EvalView as an MCP server inside Claude Code for inline regression checks.

```bash
# Add EvalView as MCP server to Claude Code
claude mcp add --transport stdio evalview -- evalview mcp serve

# Available MCP tools:
# - create_test: Generate test cases from natural language
# - run_snapshot: Save current behavior as baseline
# - run_check: Check for regressions
# - list_tests: List all test cases
# - validate_skill: Validate skill structure
# - generate_skill_tests: Generate behavior tests
# - run_skill_test: Run skill tests
# - generate_visual_report: Create HTML report

# Copy example CLAUDE.md for proactive checks
cp CLAUDE.md.example CLAUDE.md
```

## Custom Adapter Implementation

Build custom adapters for proprietary agent frameworks.

```python
# evalview/adapters/my_adapter.py
from datetime import datetime
from typing import Any, Dict, Optional
from evalview.adapters.base import AgentAdapter
from evalview.core.types import (
    ExecutionTrace, StepTrace, StepMetrics, ExecutionMetrics
)

class MyCustomAdapter(AgentAdapter):
    """Adapter for custom agent framework."""

    def __init__(
        self,
        endpoint: str,
        headers: Optional[Dict[str, str]] = None,
        timeout: float = 30.0
    ):
        self.endpoint = endpoint
        self.headers = headers or {}
        self.timeout = timeout

    @property
    def name(self) -> str:
        return "my-adapter"

    async def execute(
        self,
        query: str,
        context: Optional[Dict[str, Any]] = None
    ) -> ExecutionTrace:
        import httpx
        start_time = datetime.now()

        async with httpx.AsyncClient(timeout=self.timeout) as client:
            response = await client.post(
                self.endpoint,
                json={"query": query, **(context or {})},
                headers=self.headers
            )
            response.raise_for_status()
            data = response.json()

        steps = [
            StepTrace(
                step_id=step["id"],
                step_name=step["name"],
                tool_name=step.get("tool"),
                parameters=step.get("parameters", {}),
                output=step.get("output"),
                success=step.get("success", True),
                metrics=StepMetrics(
                    latency=step.get("latency", 0.0),
                    cost=step.get("cost", 0.0),
                    tokens=step.get("tokens")
                )
            )
            for step in data.get("steps", [])
        ]

        end_time = datetime.now()
        return ExecutionTrace(
            session_id=data.get("session_id", "custom"),
            start_time=start_time,
            end_time=end_time,
            steps=steps,
            final_output=data.get("output", ""),
            metrics=ExecutionMetrics(
                total_cost=sum(s.metrics.cost for s in steps),
                total_latency=(end_time - start_time).total_seconds() * 1000,
                total_tokens=sum(s.metrics.tokens or 0 for s in steps)
            )
        )

    async def health_check(self) -> bool:
        try:
            import httpx
            async with httpx.AsyncClient(timeout=5.0) as client:
                response = await client.get(self.endpoint)
                return response.status_code == 200
        except Exception:
            return False
```

## LangGraph Adapter Configuration

Configure EvalView to test LangGraph agents with tool call capture.

```yaml
# .evalview/config.yaml
adapter: langgraph
endpoint: http://localhost:2024
assistant_id: agent
timeout: 90
```

```yaml
# tests/langgraph-research.yaml
name: "multi-step-research"
adapter: langgraph
endpoint: http://localhost:2024

input:
  query: "What are the latest AI trends?"
  context:
    assistant_id: agent

expected:
  tools:
    - tavily_search
    - summarize
  tool_sequence:
    - tavily_search
    - summarize
  output:
    contains:
      - "AI"
      - "trends"

thresholds:
  min_score: 70
  max_cost: 0.10
  max_latency: 30000
```

## A/B Endpoint Comparison

Compare the same test suite against two different agent endpoints.

```bash
# Compare two endpoints
evalview compare \
  --v1 http://localhost:8000 \
  --v2 http://localhost:8001 \
  --label-v1 "production" \
  --label-v2 "staging"

# Skip LLM judge for fast comparison
evalview compare --v1 prod --v2 staging --no-judge

# Output:
# ┌─────────────────┬───────────┬───────────┬────────┬───────────┐
# │ Test            │ Prod      │ Staging   │ Δ      │ Verdict   │
# ├─────────────────┼───────────┼───────────┼────────┼───────────┤
# │ search-flow     │ 92.5      │ 94.0      │ +1.5   │ improved  │
# │ refund-flow     │ 88.0      │ 71.0      │ -17.0  │ degraded  │
# └─────────────────┴───────────┴───────────┴────────┴───────────┘
# Overall: CAUTION - investigate degraded tests before promoting
```

## Pytest Plugin

Use EvalView fixtures in standard pytest test suites.

```python
# test_agent.py
import pytest

def test_weather_regression(evalview_check):
    """Test weather agent hasn't regressed."""
    diff = evalview_check("weather-lookup")
    assert diff.overall_severity.value != "regression", diff.summary()

def test_booking_flow(evalview_check):
    """Test booking flow maintains expected behavior."""
    diff = evalview_check("booking-flow")
    assert diff.passed, f"Booking flow failed: {diff.summary()}"
    assert diff.tool_changes == 0, "Unexpected tool changes"

# Run with pytest
# pip install evalview  # Plugin registers automatically
# pytest test_agent.py -v
```

## Environment Variables

Configure EvalView via environment variables for CI and production use.

```bash
# LLM API keys for judge evaluation
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export HF_TOKEN="hf_..."

# Force specific judge provider
export EVAL_PROVIDER="openai"  # openai, anthropic, huggingface, ollama
export EVAL_MODEL="gpt-5"

# Slack webhook for monitoring alerts
export EVALVIEW_SLACK_WEBHOOK="https://hooks.slack.com/services/..."

# Agent endpoint
export AGENT_ENDPOINT="http://localhost:8000"

# Skills testing provider
export SKILL_TEST_PROVIDER="anthropic"
export SKILL_TEST_API_KEY="sk-..."
```

EvalView is the go-to solution for teams building production AI agents who need confidence that prompt changes, model updates, and tool modifications don't silently break existing behavior. It integrates into any workflow through CLI commands, Python API, CI/CD pipelines, or MCP servers. The framework excels at catching the types of regressions that traditional tests miss: tool choice changes, output quality degradation, cost spikes, and latency increases.

Common integration patterns include using `evalview check` in CI to block PRs with regressions, `evalview watch` during development for instant feedback, `evalview monitor` in production for continuous validation with Slack alerts, and the Python `gate()` API for programmatic regression gates in autonomous agent loops. The golden baseline system makes it easy to capture known-good behavior and detect drift over time, while statistical mode and auto-variant discovery handle the inherent non-determinism of LLM-based agents.