IncidentFox (incidentfox/incidentfox)

IncidentFox

https://github.com/incidentfox/incidentfox
Admin
IncidentFox is an open-source AI SRE that acts as an AI copilot for incident response, automatically...

Tokens:356,261
Snippets:4,692
Trust Score:5.9
Update:3 weeks ago
Show doc for...
Context Summary (auto-generated)
Raw
# IncidentFox - AI SRE Platform

IncidentFox is an open-source AI SRE platform that automatically investigates production incidents, correlates alerts, analyzes logs, and finds root causes. The platform connects to your observability stack (Datadog, Grafana, Coralogix, Elasticsearch, etc.), infrastructure (Kubernetes, AWS, GCP, Azure), and code repositories to reason through incidents and provide actionable insights. It integrates with Slack, Microsoft Teams, and Google Chat, enabling teams to investigate incidents without leaving their communication tools.

The platform follows a multi-tenant architecture with hierarchical org/team configuration, sandboxed agent execution using gVisor for security isolation, and a credential proxy pattern that ensures secrets never touch the AI agent. Key components include the SRE Agent (Claude SDK-based AI agent with 45+ skills), Config Service (hierarchical configuration with deep merge), Orchestrator (webhook routing and multi-channel output), Slack Bot (Bolt SDK integration), and Web UI (Next.js admin console and agent runner).

## Config Service API

The Config Service provides hierarchical organization and team configuration with deep merge semantics, token management, audit logging, and RBAC.

### Get Effective Team Configuration

Retrieves the merged configuration for the authenticated team, computed by merging defaults -> org -> team hierarchy.

```bash
# Get effective configuration for the current team
curl -X GET "https://config-service.example.com/api/v1/config/me" \
  -H "Authorization: Bearer <team_token>" \
  -H "Content-Type: application/json"

# Response:
{
  "node_id": "team-abc123",
  "effective_config": {
    "entrance_agent": "planner",
    "agents": {
      "planner": {
        "enabled": true,
        "model": "claude-sonnet-4-20250514",
        "system_prompt": "You are an SRE AI assistant..."
      }
    },
    "integrations": {
      "datadog": {"enabled": true, "site": "us5.datadoghq.com"},
      "github": {"enabled": true, "default_org": "mycompany"}
    }
  },
  "computed_at": "2025-01-27T10:30:00Z",
  "hierarchy": ["org-root", "group-sre", "team-abc123"]
}
```

### Update Team Configuration

Updates the team's configuration using deep merge. Dicts merge recursively, lists replace entirely.

```bash
# Update team configuration (deep merge)
curl -X PATCH "https://config-service.example.com/api/v1/config/me" \
  -H "Authorization: Bearer <team_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "integrations": {
        "slack": {
          "auto_investigate_channels": ["#incidents", "#alerts"]
        }
      },
      "agents": {
        "planner": {
          "max_turns": 50
        }
      }
    },
    "reason": "Enable auto-investigation in incident channels"
  }'

# Response:
{
  "node_id": "team-abc123",
  "node_type": "team",
  "config": {
    "integrations": {
      "slack": {
        "auto_investigate_channels": ["#incidents", "#alerts"]
      }
    },
    "agents": {"planner": {"max_turns": 50}}
  },
  "version": 5,
  "updated_at": "2025-01-27T10:35:00Z",
  "updated_by": "team"
}
```

### Admin: Get Node Configuration Hierarchy

Admin endpoint to retrieve raw configuration for any node in the hierarchy.

```bash
# Get raw config for a specific node (admin only)
curl -X GET "https://config-service.example.com/api/v1/config/orgs/org-123/nodes/team-abc/raw" \
  -H "Authorization: Bearer <admin_token>"

# Get effective (merged) config for a node
curl -X GET "https://config-service.example.com/api/v1/config/orgs/org-123/nodes/team-abc/effective" \
  -H "Authorization: Bearer <admin_token>"

# Validate node configuration
curl -X POST "https://config-service.example.com/api/v1/config/orgs/org-123/nodes/team-abc/validate" \
  -H "Authorization: Bearer <admin_token>"

# Response:
{
  "node_id": "team-abc",
  "valid": true,
  "missing_required": [],
  "errors": []
}
```

## SRE Agent Investigation API

The SRE Agent server manages AI investigations in isolated Kubernetes sandboxes with SSE streaming.

### Run Investigation

Starts an AI investigation session with SSE streaming for real-time progress updates.

```bash
# Start a new investigation (streaming SSE response)
curl -X POST "https://sre-agent.example.com/investigate" \
  -H "Authorization: Bearer <service_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Investigate the high error rate on the payment service",
    "tenant_id": "slack-T123ABC",
    "team_id": "team-payments",
    "team_token": "<team_bearer_token>"
  }'

# SSE Response stream:
data: {"type": "thinking", "content": "Starting investigation of payment service errors..."}

data: {"type": "tool_use", "name": "datadog_get_statistics", "input": {"service": "payment"}}

data: {"type": "tool_result", "name": "datadog_get_statistics", "output": {"error_rate": 15.2, "top_errors": [...]}}

data: {"type": "thinking", "content": "Error rate is elevated at 15.2%. Checking Kubernetes pods..."}

data: {"type": "result", "content": "## Root Cause Analysis\n\nThe payment service is experiencing OOMKilled events..."}
```

### Follow-up Investigation (Session Continuity)

Continue an investigation in the same sandbox session.

```bash
# Follow-up in existing thread (reuses sandbox)
curl -X POST "https://sre-agent.example.com/investigate" \
  -H "Authorization: Bearer <service_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What are the memory limits on this pod?",
    "thread_id": "thread-abc123",
    "tenant_id": "slack-T123ABC",
    "team_id": "team-payments"
  }'
```

### Interrupt Investigation

Stop a long-running investigation gracefully.

```bash
# Interrupt current execution
curl -X POST "https://sre-agent.example.com/interrupt" \
  -H "Authorization: Bearer <service_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "thread_id": "thread-abc123"
  }'
```

## Orchestrator Webhook API

The Orchestrator routes external webhooks to the appropriate team's AI agent for automated investigation.

### GitHub Webhook

Handles GitHub push, pull_request, issues, and issue_comment events.

```bash
# GitHub webhook (auto-triggered by GitHub)
curl -X POST "https://orchestrator.example.com/webhooks/github" \
  -H "X-GitHub-Event: pull_request" \
  -H "X-GitHub-Delivery: abc123" \
  -H "X-Hub-Signature-256: sha256=<hmac_signature>" \
  -H "Content-Type: application/json" \
  -d '{
    "action": "opened",
    "pull_request": {
      "number": 42,
      "title": "Fix memory leak in payment service",
      "user": {"login": "developer"}
    },
    "repository": {"full_name": "mycompany/payment-service"}
  }'

# Response (async processing):
{"ok": true}
```

### PagerDuty Webhook

Handles PagerDuty incident events with optional alert correlation.

```bash
# PagerDuty webhook (v3 format)
curl -X POST "https://orchestrator.example.com/webhooks/pagerduty" \
  -H "X-PagerDuty-Signature: <signature>" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{
      "event": {
        "event_type": "incident.triggered",
        "data": {
          "id": "P123ABC",
          "title": "High CPU on payment-service",
          "urgency": "high",
          "service": {"id": "PSVC123"}
        }
      }
    }]
  }'
```

### Incident.io Webhook

Handles Incident.io incident lifecycle and public alert events.

```bash
# Incident.io webhook (Standard Webhooks format)
curl -X POST "https://orchestrator.example.com/webhooks/incidentio" \
  -H "webhook-id: msg_123" \
  -H "webhook-timestamp: 1706353200" \
  -H "webhook-signature: v1,<base64_signature>" \
  -H "Content-Type: application/json" \
  -d '{
    "event_type": "public_alert.alert_created_v1",
    "public_alert.alert_created_v1": {
      "alert_source_id": "01KEGMSPPCKFPYHT2ZSNQ7WY3J",
      "title": "Database connection pool exhausted",
      "status": "firing",
      "priority": "high"
    }
  }'
```

### Google Chat Webhook

Handles Google Chat app events for investigation via Google Workspace.

```bash
# Google Chat webhook (JWT authenticated)
curl -X POST "https://orchestrator.example.com/webhooks/google-chat" \
  -H "Authorization: Bearer <google_chat_jwt>" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "MESSAGE",
    "message": {
      "text": "@IncidentFox investigate the 500 errors on checkout",
      "sender": {"displayName": "Alice"}
    },
    "space": {"name": "spaces/AAAA123"}
  }'
```

## SRE Agent Skills

The agent uses a progressive knowledge loading system with 45+ skills. Each skill provides domain-specific capabilities.

### Kubernetes Debugging Skill

Debug Kubernetes pods, deployments, and resource issues.

```bash
# List clusters (MANDATORY first step)
python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py

# List pods in a namespace
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py \
  -n production \
  --cluster-id abc123

# Get pod events (check BEFORE logs - explains 80% of issues)
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py \
  payment-7f8b9c6d5-x2k4m \
  -n production \
  --cluster-id abc123

# Get pod logs
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py \
  payment-7f8b9c6d5-x2k4m \
  -n production \
  --tail 100 \
  --cluster-id abc123

# Describe deployment for rollout status
python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py \
  payment \
  -n production \
  --cluster-id abc123
```

### Datadog Analysis Skill

Query Datadog logs and metrics with statistics-first methodology.

```bash
# ALWAYS start with statistics
python .claude/skills/observability-datadog/scripts/get_statistics.py \
  --service payment \
  --time-range 60

# Sample errors strategically
python .claude/skills/observability-datadog/scripts/sample_logs.py \
  --strategy errors_only \
  --service payment \
  --limit 20

# Investigate around a specific timestamp
python .claude/skills/observability-datadog/scripts/sample_logs.py \
  --strategy around_time \
  --timestamp "2025-01-27T05:00:00Z" \
  --window 5
```

### Grafana Dashboard Skill

Query Grafana dashboards and Prometheus metrics.

```bash
# List available datasources
python .claude/skills/observability-grafana/scripts/list_datasources.py

# Search dashboards
python .claude/skills/observability-grafana/scripts/list_dashboards.py \
  --query "kubernetes"

# Query Prometheus metrics via Grafana
python .claude/skills/observability-grafana/scripts/query_prometheus.py \
  --query "rate(http_requests_total{service='payment',status=~'5..'}[5m])" \
  --time-range 60

# Get active alerts
python .claude/skills/observability-grafana/scripts/get_alerts.py
```

## Web UI API

The Next.js web UI provides admin console and agent runner functionality.

### Stream Agent Run

Execute an agent investigation with real-time SSE streaming.

```typescript
// POST /api/team/agent/stream
const response = await fetch('/api/team/agent/stream', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    message: 'Investigate the payment service errors',
    max_turns: 20,
    timeout: 300
  }),
  credentials: 'include'
});

// Process SSE stream
const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const chunk = decoder.decode(value);
  const lines = chunk.split('\n');

  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const event = JSON.parse(line.slice(6));

      switch (event.type) {
        case 'thinking':
          console.log('Agent thinking:', event.content);
          break;
        case 'tool_use':
          console.log('Using tool:', event.name, event.input);
          break;
        case 'result':
          console.log('Investigation complete:', event.content);
          break;
      }
    }
  }
}
```

### Team Configuration Management

```typescript
// GET /api/team/config - Get team configuration
const config = await fetch('/api/team/config', {
  credentials: 'include'
}).then(r => r.json());

// PATCH /api/team/config - Update team configuration
await fetch('/api/team/config', {
  method: 'PATCH',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    integrations: {
      datadog: { enabled: true, site: 'us5.datadoghq.com' }
    }
  }),
  credentials: 'include'
});

// GET /api/team/config/required-fields - Check missing required fields
const required = await fetch('/api/team/config/required-fields', {
  credentials: 'include'
}).then(r => r.json());

console.log(`Configured: ${required.configured}/${required.total_required}`);
console.log('Missing:', required.missing);
```

### Knowledge Base Operations

```typescript
// GET /api/team/knowledge/tree - Get knowledge tree structure
const tree = await fetch('/api/team/knowledge/tree', {
  credentials: 'include'
}).then(r => r.json());

// POST /api/team/knowledge/tree/search - Semantic search
const results = await fetch('/api/team/knowledge/tree/search', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    query: 'how to handle OOMKilled errors',
    top_k: 5
  }),
  credentials: 'include'
}).then(r => r.json());

// POST /api/team/knowledge/upload - Upload document
const formData = new FormData();
formData.append('file', pdfFile);
formData.append('tree_name', 'runbooks');

await fetch('/api/team/knowledge/upload', {
  method: 'POST',
  body: formData,
  credentials: 'include'
});
```

## Slack Bot Integration

The Slack Bot handles OAuth, events, and interactive components via Bolt SDK.

### Socket Mode (Local Development)

```python
# Local development with Socket Mode
from slack_bolt import App
from slack_bolt.adapter.socket_mode import SocketModeHandler

app = App(
    token=os.environ["SLACK_BOT_TOKEN"],
    signing_secret=os.environ["SLACK_SIGNING_SECRET"]
)

@app.event("app_mention")
def handle_mention(event, say, client):
    """Handle @IncidentFox mentions"""
    thread_ts = event.get("thread_ts") or event["ts"]
    channel = event["channel"]
    text = event["text"]

    # Stream investigation to thread
    say(
        text="Starting investigation...",
        channel=channel,
        thread_ts=thread_ts
    )

    # Call SRE Agent and stream results
    # (implementation in investigation_handler.py)

if __name__ == "__main__":
    handler = SocketModeHandler(app, os.environ["SLACK_APP_TOKEN"])
    handler.start()
```

### HTTP Mode (Production)

```python
# Production with HTTP mode and multi-app registry
from flask import Flask, request
from slack_bolt import App
from slack_bolt.adapter.flask import SlackRequestHandler
from app_registry import SlackAppRegistry

flask_app = Flask(__name__)
registry = SlackAppRegistry()
registry.load_all()

@flask_app.route("/slack/<slug>/events", methods=["POST"])
def slack_events_for_app(slug):
    """Handle events for a specific white-label app"""
    app_handler = registry.get_handler(slug)
    if not app_handler:
        return {"error": f"Unknown app: {slug}"}, 404
    return app_handler.handle(request)

@flask_app.route("/slack/<slug>/oauth_redirect", methods=["GET"])
def slack_oauth_redirect_for_app(slug):
    """Handle OAuth callback for multi-workspace install"""
    # Exchange code for token, save installation
    # (implementation in app.py)
```

## Local Development

Quick start for local development environment.

```bash
# Clone and start the full stack
git clone https://github.com/incidentfox/incidentfox && cd incidentfox
cp .env.example .env

# Add your LLM API key to .env
echo "ANTHROPIC_API_KEY=sk-ant-..." >> .env

# Start all services (Postgres, config-service, sre-agent, web_ui)
make dev

# Optional: Add Slack integration
# 1. Create Slack app using slack-bot/slack-manifest.json
# 2. Add tokens to .env:
echo "SLACK_BOT_TOKEN=xoxb-..." >> .env
echo "SLACK_APP_TOKEN=xapp-..." >> .env

# Start with Slack bot
make dev-slack
```

## Summary

IncidentFox provides a complete platform for AI-powered incident investigation with three main entry points: Slack (via Slack Bot with Socket Mode or HTTP), Web UI (Next.js dashboard with SSE streaming), and Webhooks (GitHub, PagerDuty, Incident.io, Blameless, FireHydrant, Vercel). The platform uses hierarchical configuration (org -> group -> team) with deep merge semantics, ensuring teams can inherit organizational defaults while customizing their specific settings. All investigations run in isolated gVisor Kubernetes sandboxes with a credential proxy pattern that keeps secrets out of agent code.

Integration patterns include: (1) Direct API calls to Config Service for configuration management, (2) SSE streaming from SRE Agent for real-time investigation progress, (3) Webhook handlers in Orchestrator for automated trigger-based investigations, (4) Skill-based progressive loading for domain-specific capabilities (Kubernetes, Datadog, Grafana, etc.), and (5) Web UI proxy routes that authenticate users and forward to backend services. The platform supports 45+ integrations across observability (Datadog, Grafana, Coralogix, Elasticsearch, Splunk), incident management (PagerDuty, Incident.io, Blameless, FireHydrant), infrastructure (Kubernetes, AWS, GCP, Azure), and code (GitHub, GitLab, Sourcegraph).