Try Live
Add Docs
Rankings
Pricing
Enterprise
Docs
Install
Install
Docs
Pricing
Enterprise
More...
More...
Try Live
Rankings
Add Docs
IncidentFox
https://github.com/incidentfox/incidentfox
Admin
IncidentFox is an open-source AI SRE that acts as an AI copilot for incident response, automatically
...
Tokens:
356,261
Snippets:
4,692
Trust Score:
5.9
Update:
3 weeks ago
Context
Skills
Chat
Benchmark
63.2
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# IncidentFox - AI SRE Platform IncidentFox is an open-source AI SRE platform that automatically investigates production incidents, correlates alerts, analyzes logs, and finds root causes. The platform connects to your observability stack (Datadog, Grafana, Coralogix, Elasticsearch, etc.), infrastructure (Kubernetes, AWS, GCP, Azure), and code repositories to reason through incidents and provide actionable insights. It integrates with Slack, Microsoft Teams, and Google Chat, enabling teams to investigate incidents without leaving their communication tools. The platform follows a multi-tenant architecture with hierarchical org/team configuration, sandboxed agent execution using gVisor for security isolation, and a credential proxy pattern that ensures secrets never touch the AI agent. Key components include the SRE Agent (Claude SDK-based AI agent with 45+ skills), Config Service (hierarchical configuration with deep merge), Orchestrator (webhook routing and multi-channel output), Slack Bot (Bolt SDK integration), and Web UI (Next.js admin console and agent runner). ## Config Service API The Config Service provides hierarchical organization and team configuration with deep merge semantics, token management, audit logging, and RBAC. ### Get Effective Team Configuration Retrieves the merged configuration for the authenticated team, computed by merging defaults -> org -> team hierarchy. ```bash # Get effective configuration for the current team curl -X GET "https://config-service.example.com/api/v1/config/me" \ -H "Authorization: Bearer <team_token>" \ -H "Content-Type: application/json" # Response: { "node_id": "team-abc123", "effective_config": { "entrance_agent": "planner", "agents": { "planner": { "enabled": true, "model": "claude-sonnet-4-20250514", "system_prompt": "You are an SRE AI assistant..." } }, "integrations": { "datadog": {"enabled": true, "site": "us5.datadoghq.com"}, "github": {"enabled": true, "default_org": "mycompany"} } }, "computed_at": "2025-01-27T10:30:00Z", "hierarchy": ["org-root", "group-sre", "team-abc123"] } ``` ### Update Team Configuration Updates the team's configuration using deep merge. Dicts merge recursively, lists replace entirely. ```bash # Update team configuration (deep merge) curl -X PATCH "https://config-service.example.com/api/v1/config/me" \ -H "Authorization: Bearer <team_token>" \ -H "Content-Type: application/json" \ -d '{ "config": { "integrations": { "slack": { "auto_investigate_channels": ["#incidents", "#alerts"] } }, "agents": { "planner": { "max_turns": 50 } } }, "reason": "Enable auto-investigation in incident channels" }' # Response: { "node_id": "team-abc123", "node_type": "team", "config": { "integrations": { "slack": { "auto_investigate_channels": ["#incidents", "#alerts"] } }, "agents": {"planner": {"max_turns": 50}} }, "version": 5, "updated_at": "2025-01-27T10:35:00Z", "updated_by": "team" } ``` ### Admin: Get Node Configuration Hierarchy Admin endpoint to retrieve raw configuration for any node in the hierarchy. ```bash # Get raw config for a specific node (admin only) curl -X GET "https://config-service.example.com/api/v1/config/orgs/org-123/nodes/team-abc/raw" \ -H "Authorization: Bearer <admin_token>" # Get effective (merged) config for a node curl -X GET "https://config-service.example.com/api/v1/config/orgs/org-123/nodes/team-abc/effective" \ -H "Authorization: Bearer <admin_token>" # Validate node configuration curl -X POST "https://config-service.example.com/api/v1/config/orgs/org-123/nodes/team-abc/validate" \ -H "Authorization: Bearer <admin_token>" # Response: { "node_id": "team-abc", "valid": true, "missing_required": [], "errors": [] } ``` ## SRE Agent Investigation API The SRE Agent server manages AI investigations in isolated Kubernetes sandboxes with SSE streaming. ### Run Investigation Starts an AI investigation session with SSE streaming for real-time progress updates. ```bash # Start a new investigation (streaming SSE response) curl -X POST "https://sre-agent.example.com/investigate" \ -H "Authorization: Bearer <service_token>" \ -H "Content-Type: application/json" \ -d '{ "prompt": "Investigate the high error rate on the payment service", "tenant_id": "slack-T123ABC", "team_id": "team-payments", "team_token": "<team_bearer_token>" }' # SSE Response stream: data: {"type": "thinking", "content": "Starting investigation of payment service errors..."} data: {"type": "tool_use", "name": "datadog_get_statistics", "input": {"service": "payment"}} data: {"type": "tool_result", "name": "datadog_get_statistics", "output": {"error_rate": 15.2, "top_errors": [...]}} data: {"type": "thinking", "content": "Error rate is elevated at 15.2%. Checking Kubernetes pods..."} data: {"type": "result", "content": "## Root Cause Analysis\n\nThe payment service is experiencing OOMKilled events..."} ``` ### Follow-up Investigation (Session Continuity) Continue an investigation in the same sandbox session. ```bash # Follow-up in existing thread (reuses sandbox) curl -X POST "https://sre-agent.example.com/investigate" \ -H "Authorization: Bearer <service_token>" \ -H "Content-Type: application/json" \ -d '{ "prompt": "What are the memory limits on this pod?", "thread_id": "thread-abc123", "tenant_id": "slack-T123ABC", "team_id": "team-payments" }' ``` ### Interrupt Investigation Stop a long-running investigation gracefully. ```bash # Interrupt current execution curl -X POST "https://sre-agent.example.com/interrupt" \ -H "Authorization: Bearer <service_token>" \ -H "Content-Type: application/json" \ -d '{ "thread_id": "thread-abc123" }' ``` ## Orchestrator Webhook API The Orchestrator routes external webhooks to the appropriate team's AI agent for automated investigation. ### GitHub Webhook Handles GitHub push, pull_request, issues, and issue_comment events. ```bash # GitHub webhook (auto-triggered by GitHub) curl -X POST "https://orchestrator.example.com/webhooks/github" \ -H "X-GitHub-Event: pull_request" \ -H "X-GitHub-Delivery: abc123" \ -H "X-Hub-Signature-256: sha256=<hmac_signature>" \ -H "Content-Type: application/json" \ -d '{ "action": "opened", "pull_request": { "number": 42, "title": "Fix memory leak in payment service", "user": {"login": "developer"} }, "repository": {"full_name": "mycompany/payment-service"} }' # Response (async processing): {"ok": true} ``` ### PagerDuty Webhook Handles PagerDuty incident events with optional alert correlation. ```bash # PagerDuty webhook (v3 format) curl -X POST "https://orchestrator.example.com/webhooks/pagerduty" \ -H "X-PagerDuty-Signature: <signature>" \ -H "Content-Type: application/json" \ -d '{ "messages": [{ "event": { "event_type": "incident.triggered", "data": { "id": "P123ABC", "title": "High CPU on payment-service", "urgency": "high", "service": {"id": "PSVC123"} } } }] }' ``` ### Incident.io Webhook Handles Incident.io incident lifecycle and public alert events. ```bash # Incident.io webhook (Standard Webhooks format) curl -X POST "https://orchestrator.example.com/webhooks/incidentio" \ -H "webhook-id: msg_123" \ -H "webhook-timestamp: 1706353200" \ -H "webhook-signature: v1,<base64_signature>" \ -H "Content-Type: application/json" \ -d '{ "event_type": "public_alert.alert_created_v1", "public_alert.alert_created_v1": { "alert_source_id": "01KEGMSPPCKFPYHT2ZSNQ7WY3J", "title": "Database connection pool exhausted", "status": "firing", "priority": "high" } }' ``` ### Google Chat Webhook Handles Google Chat app events for investigation via Google Workspace. ```bash # Google Chat webhook (JWT authenticated) curl -X POST "https://orchestrator.example.com/webhooks/google-chat" \ -H "Authorization: Bearer <google_chat_jwt>" \ -H "Content-Type: application/json" \ -d '{ "type": "MESSAGE", "message": { "text": "@IncidentFox investigate the 500 errors on checkout", "sender": {"displayName": "Alice"} }, "space": {"name": "spaces/AAAA123"} }' ``` ## SRE Agent Skills The agent uses a progressive knowledge loading system with 45+ skills. Each skill provides domain-specific capabilities. ### Kubernetes Debugging Skill Debug Kubernetes pods, deployments, and resource issues. ```bash # List clusters (MANDATORY first step) python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py # List pods in a namespace python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py \ -n production \ --cluster-id abc123 # Get pod events (check BEFORE logs - explains 80% of issues) python .claude/skills/infrastructure-kubernetes/scripts/get_events.py \ payment-7f8b9c6d5-x2k4m \ -n production \ --cluster-id abc123 # Get pod logs python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py \ payment-7f8b9c6d5-x2k4m \ -n production \ --tail 100 \ --cluster-id abc123 # Describe deployment for rollout status python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py \ payment \ -n production \ --cluster-id abc123 ``` ### Datadog Analysis Skill Query Datadog logs and metrics with statistics-first methodology. ```bash # ALWAYS start with statistics python .claude/skills/observability-datadog/scripts/get_statistics.py \ --service payment \ --time-range 60 # Sample errors strategically python .claude/skills/observability-datadog/scripts/sample_logs.py \ --strategy errors_only \ --service payment \ --limit 20 # Investigate around a specific timestamp python .claude/skills/observability-datadog/scripts/sample_logs.py \ --strategy around_time \ --timestamp "2025-01-27T05:00:00Z" \ --window 5 ``` ### Grafana Dashboard Skill Query Grafana dashboards and Prometheus metrics. ```bash # List available datasources python .claude/skills/observability-grafana/scripts/list_datasources.py # Search dashboards python .claude/skills/observability-grafana/scripts/list_dashboards.py \ --query "kubernetes" # Query Prometheus metrics via Grafana python .claude/skills/observability-grafana/scripts/query_prometheus.py \ --query "rate(http_requests_total{service='payment',status=~'5..'}[5m])" \ --time-range 60 # Get active alerts python .claude/skills/observability-grafana/scripts/get_alerts.py ``` ## Web UI API The Next.js web UI provides admin console and agent runner functionality. ### Stream Agent Run Execute an agent investigation with real-time SSE streaming. ```typescript // POST /api/team/agent/stream const response = await fetch('/api/team/agent/stream', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ message: 'Investigate the payment service errors', max_turns: 20, timeout: 300 }), credentials: 'include' }); // Process SSE stream const reader = response.body.getReader(); const decoder = new TextDecoder(); while (true) { const { done, value } = await reader.read(); if (done) break; const chunk = decoder.decode(value); const lines = chunk.split('\n'); for (const line of lines) { if (line.startsWith('data: ')) { const event = JSON.parse(line.slice(6)); switch (event.type) { case 'thinking': console.log('Agent thinking:', event.content); break; case 'tool_use': console.log('Using tool:', event.name, event.input); break; case 'result': console.log('Investigation complete:', event.content); break; } } } } ``` ### Team Configuration Management ```typescript // GET /api/team/config - Get team configuration const config = await fetch('/api/team/config', { credentials: 'include' }).then(r => r.json()); // PATCH /api/team/config - Update team configuration await fetch('/api/team/config', { method: 'PATCH', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ integrations: { datadog: { enabled: true, site: 'us5.datadoghq.com' } } }), credentials: 'include' }); // GET /api/team/config/required-fields - Check missing required fields const required = await fetch('/api/team/config/required-fields', { credentials: 'include' }).then(r => r.json()); console.log(`Configured: ${required.configured}/${required.total_required}`); console.log('Missing:', required.missing); ``` ### Knowledge Base Operations ```typescript // GET /api/team/knowledge/tree - Get knowledge tree structure const tree = await fetch('/api/team/knowledge/tree', { credentials: 'include' }).then(r => r.json()); // POST /api/team/knowledge/tree/search - Semantic search const results = await fetch('/api/team/knowledge/tree/search', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ query: 'how to handle OOMKilled errors', top_k: 5 }), credentials: 'include' }).then(r => r.json()); // POST /api/team/knowledge/upload - Upload document const formData = new FormData(); formData.append('file', pdfFile); formData.append('tree_name', 'runbooks'); await fetch('/api/team/knowledge/upload', { method: 'POST', body: formData, credentials: 'include' }); ``` ## Slack Bot Integration The Slack Bot handles OAuth, events, and interactive components via Bolt SDK. ### Socket Mode (Local Development) ```python # Local development with Socket Mode from slack_bolt import App from slack_bolt.adapter.socket_mode import SocketModeHandler app = App( token=os.environ["SLACK_BOT_TOKEN"], signing_secret=os.environ["SLACK_SIGNING_SECRET"] ) @app.event("app_mention") def handle_mention(event, say, client): """Handle @IncidentFox mentions""" thread_ts = event.get("thread_ts") or event["ts"] channel = event["channel"] text = event["text"] # Stream investigation to thread say( text="Starting investigation...", channel=channel, thread_ts=thread_ts ) # Call SRE Agent and stream results # (implementation in investigation_handler.py) if __name__ == "__main__": handler = SocketModeHandler(app, os.environ["SLACK_APP_TOKEN"]) handler.start() ``` ### HTTP Mode (Production) ```python # Production with HTTP mode and multi-app registry from flask import Flask, request from slack_bolt import App from slack_bolt.adapter.flask import SlackRequestHandler from app_registry import SlackAppRegistry flask_app = Flask(__name__) registry = SlackAppRegistry() registry.load_all() @flask_app.route("/slack/<slug>/events", methods=["POST"]) def slack_events_for_app(slug): """Handle events for a specific white-label app""" app_handler = registry.get_handler(slug) if not app_handler: return {"error": f"Unknown app: {slug}"}, 404 return app_handler.handle(request) @flask_app.route("/slack/<slug>/oauth_redirect", methods=["GET"]) def slack_oauth_redirect_for_app(slug): """Handle OAuth callback for multi-workspace install""" # Exchange code for token, save installation # (implementation in app.py) ``` ## Local Development Quick start for local development environment. ```bash # Clone and start the full stack git clone https://github.com/incidentfox/incidentfox && cd incidentfox cp .env.example .env # Add your LLM API key to .env echo "ANTHROPIC_API_KEY=sk-ant-..." >> .env # Start all services (Postgres, config-service, sre-agent, web_ui) make dev # Optional: Add Slack integration # 1. Create Slack app using slack-bot/slack-manifest.json # 2. Add tokens to .env: echo "SLACK_BOT_TOKEN=xoxb-..." >> .env echo "SLACK_APP_TOKEN=xapp-..." >> .env # Start with Slack bot make dev-slack ``` ## Summary IncidentFox provides a complete platform for AI-powered incident investigation with three main entry points: Slack (via Slack Bot with Socket Mode or HTTP), Web UI (Next.js dashboard with SSE streaming), and Webhooks (GitHub, PagerDuty, Incident.io, Blameless, FireHydrant, Vercel). The platform uses hierarchical configuration (org -> group -> team) with deep merge semantics, ensuring teams can inherit organizational defaults while customizing their specific settings. All investigations run in isolated gVisor Kubernetes sandboxes with a credential proxy pattern that keeps secrets out of agent code. Integration patterns include: (1) Direct API calls to Config Service for configuration management, (2) SSE streaming from SRE Agent for real-time investigation progress, (3) Webhook handlers in Orchestrator for automated trigger-based investigations, (4) Skill-based progressive loading for domain-specific capabilities (Kubernetes, Datadog, Grafana, etc.), and (5) Web UI proxy routes that authenticate users and forward to backend services. The platform supports 45+ integrations across observability (Datadog, Grafana, Coralogix, Elasticsearch, Splunk), incident management (PagerDuty, Incident.io, Blameless, FireHydrant), infrastructure (Kubernetes, AWS, GCP, Azure), and code (GitHub, GitLab, Sourcegraph).