# Bytebot

Bytebot is an open-source AI desktop agent that provides a complete virtual computer environment for automating any task. It runs in Docker containers on your own infrastructure, giving you an AI assistant that can control a full Ubuntu Linux desktop with pre-installed applications including browsers, email clients, office tools, and development environments. The agent understands natural language instructions and executes them by controlling the mouse, keyboard, and screen - just like a human would.

The platform consists of four integrated components: a virtual desktop (Ubuntu 22.04 with XFCE4), an AI agent (NestJS service supporting Claude, GPT, and Gemini), a task interface (Next.js web app), and REST APIs for programmatic control. Bytebot excels at enterprise automation (RPA replacement), document processing, multi-system integrations, and development/QA workflows. It handles authentication automatically via password manager extensions and can process uploaded files including PDFs directly into the LLM context.

## Quick Start with Docker Compose

Deploy Bytebot with Docker Compose for a complete self-hosted AI desktop automation system.

```bash
# Clone and configure
git clone https://github.com/bytebot-ai/bytebot.git
cd bytebot

# Configure your AI provider (choose one)
echo "ANTHROPIC_API_KEY=sk-ant-your-key-here" > docker/.env
# Or: echo "OPENAI_API_KEY=sk-your-key-here" > docker/.env
# Or: echo "GEMINI_API_KEY=your-key-here" > docker/.env

# Start the agent stack
docker-compose -f docker/docker-compose.yml up -d

# Access the UI at http://localhost:9992
# Agent API at http://localhost:9991
# Desktop API at http://localhost:9990
```

## Tasks API - Create Task

Create a new task for the AI agent to process. Tasks can include natural language descriptions and optional file uploads for document processing.

```bash
# Create a simple task
curl -X POST http://localhost:9991/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "description": "Search for flights from NYC to London next month and create a comparison document",
    "priority": "HIGH"
  }'

# Response:
# {
#   "id": "task-123",
#   "description": "Search for flights from NYC to London...",
#   "status": "PENDING",
#   "priority": "HIGH",
#   "createdAt": "2025-04-14T12:00:00Z",
#   "updatedAt": "2025-04-14T12:00:00Z"
# }

# Create task with file upload (multipart/form-data)
curl -X POST http://localhost:9991/tasks \
  -F "description=Analyze the uploaded contracts and extract all payment terms and deadlines" \
  -F "priority=HIGH" \
  -F "files=@contract1.pdf" \
  -F "files=@contract2.pdf"
```

## Tasks API - Get Tasks

Retrieve all tasks or get a specific task by ID, including its message history.

```bash
# Get all tasks
curl -X GET http://localhost:9991/tasks

# Response:
# [
#   {
#     "id": "task-123",
#     "description": "Download invoices from webmail",
#     "status": "COMPLETED",
#     "priority": "MEDIUM",
#     "createdAt": "2025-04-14T12:00:00Z",
#     "updatedAt": "2025-04-14T12:30:00Z"
#   },
#   ...
# ]

# Get specific task with messages
curl -X GET http://localhost:9991/tasks/task-123

# Get currently in-progress task
curl -X GET http://localhost:9991/tasks/in-progress
```

## Tasks API - Update and Delete Tasks

Update task status/priority or delete tasks from the system.

```bash
# Update task status and priority
curl -X PATCH http://localhost:9991/tasks/task-123 \
  -H "Content-Type: application/json" \
  -d '{
    "status": "COMPLETED",
    "priority": "HIGH"
  }'

# Delete a task (returns 204 No Content)
curl -X DELETE http://localhost:9991/tasks/task-123

# Task statuses: PENDING, IN_PROGRESS, NEEDS_HELP, NEEDS_REVIEW, COMPLETED, CANCELLED, FAILED
# Priority levels: LOW, MEDIUM, HIGH, URGENT
```

## Computer Use API - Screenshot

Capture a screenshot of the virtual desktop. Returns a base64-encoded PNG image.

```bash
# Take a screenshot
curl -X POST http://localhost:9990/computer-use \
  -H "Content-Type: application/json" \
  -d '{"action": "screenshot"}'

# Response:
# {
#   "success": true,
#   "data": {
#     "image": "iVBORw0KGgoAAAANSUhEUgAAB4AAAAQ..."
#   }
# }

# Save screenshot to file (bash)
response=$(curl -s -X POST http://localhost:9990/computer-use \
  -H "Content-Type: application/json" \
  -d '{"action": "screenshot"}')
echo $response | jq -r '.data.image' | base64 -d > screenshot.png
```

## Computer Use API - Mouse Actions

Control mouse movements, clicks, and drags on the virtual desktop.

```bash
# Move mouse to coordinates
curl -X POST http://localhost:9990/computer-use \
  -H "Content-Type: application/json" \
  -d '{
    "action": "move_mouse",
    "coordinates": {"x": 500, "y": 300}
  }'

# Click at coordinates (left, right, or middle button)
curl -X POST http://localhost:9990/computer-use \
  -H "Content-Type: application/json" \
  -d '{
    "action": "click_mouse",
    "coordinates": {"x": 500, "y": 300},
    "button": "left",
    "clickCount": 1
  }'

# Double-click with modifier keys
curl -X POST http://localhost:9990/computer-use \
  -H "Content-Type: application/json" \
  -d '{
    "action": "click_mouse",
    "coordinates": {"x": 500, "y": 300},
    "button": "left",
    "clickCount": 2,
    "holdKeys": ["ctrl"]
  }'

# Drag from one point to another
curl -X POST http://localhost:9990/computer-use \
  -H "Content-Type: application/json" \
  -d '{
    "action": "drag_mouse",
    "path": [
      {"x": 100, "y": 100},
      {"x": 300, "y": 300}
    ],
    "button": "left"
  }'

# Scroll down 5 steps
curl -X POST http://localhost:9990/computer-use \
  -H "Content-Type: application/json" \
  -d '{
    "action": "scroll",
    "direction": "down",
    "scrollCount": 5
  }'

# Get current cursor position
curl -X POST http://localhost:9990/computer-use \
  -H "Content-Type: application/json" \
  -d '{"action": "cursor_position"}'
# Response: {"success": true, "data": {"x": 500, "y": 300}}
```

## Computer Use API - Keyboard Actions

Type text, press keys, and execute keyboard shortcuts.

```bash
# Type text with optional delay between characters
curl -X POST http://localhost:9990/computer-use \
  -H "Content-Type: application/json" \
  -d '{
    "action": "type_text",
    "text": "Hello, Bytebot!",
    "delay": 50
  }'

# Paste text (useful for special characters and emojis)
curl -X POST http://localhost:9990/computer-use \
  -H "Content-Type: application/json" \
  -d '{
    "action": "paste_text",
    "text": "Special characters: (C)(R)(TM) and emojis"
  }'

# Type individual keys in sequence
curl -X POST http://localhost:9990/computer-use \
  -H "Content-Type: application/json" \
  -d '{
    "action": "type_keys",
    "keys": ["a", "b", "c", "enter"],
    "delay": 50
  }'

# Press keyboard shortcut (Ctrl+S to save)
curl -X POST http://localhost:9990/computer-use \
  -H "Content-Type: application/json" \
  -d '{
    "action": "press_keys",
    "keys": ["ctrl", "s"],
    "press": "down"
  }'

# Wait for specified duration (milliseconds)
curl -X POST http://localhost:9990/computer-use \
  -H "Content-Type: application/json" \
  -d '{
    "action": "wait",
    "duration": 2000
  }'
```

## Computer Use API - Application Switching

Switch between applications in the virtual desktop environment.

```bash
# Switch to Firefox browser
curl -X POST http://localhost:9990/computer-use \
  -H "Content-Type: application/json" \
  -d '{
    "action": "application",
    "application": "firefox"
  }'

# Available applications:
# - firefox: Mozilla Firefox browser
# - 1password: Password manager
# - thunderbird: Email client
# - vscode: Visual Studio Code
# - terminal: Terminal/console
# - desktop: Switch to desktop
# - directory: File manager
```

## Computer Use API - File Operations

Read and write files in the virtual desktop filesystem.

```bash
# Write a file (content must be base64 encoded)
curl -X POST http://localhost:9990/computer-use \
  -H "Content-Type: application/json" \
  -d '{
    "action": "write_file",
    "path": "/home/user/documents/example.txt",
    "data": "SGVsbG8gV29ybGQh"
  }'

# Response:
# {
#   "success": true,
#   "message": "File written successfully to: /home/user/documents/example.txt"
# }

# Read a file (returns base64 encoded content)
curl -X POST http://localhost:9990/computer-use \
  -H "Content-Type: application/json" \
  -d '{
    "action": "read_file",
    "path": "/home/user/documents/example.txt"
  }'

# Response:
# {
#   "success": true,
#   "data": "SGVsbG8gV29ybGQh",
#   "name": "example.txt",
#   "size": 12,
#   "mediaType": "text/plain"
# }
```

## Python SDK Example

Complete Python example for automating browser tasks with the Computer Use API.

```python
import requests
import base64
import time

class BytebotClient:
    def __init__(self, base_url="http://localhost:9990"):
        self.base_url = base_url

    def computer_action(self, action, **params):
        """Execute a computer action on the virtual desktop."""
        url = f"{self.base_url}/computer-use"
        data = {"action": action, **params}
        response = requests.post(url, json=data)
        return response.json()

    def screenshot(self):
        """Take a screenshot and return the image data."""
        result = self.computer_action("screenshot")
        if result["success"]:
            return base64.b64decode(result["data"]["image"])
        return None

    def click(self, x, y, button="left", count=1):
        """Click at specified coordinates."""
        return self.computer_action("click_mouse",
            coordinates={"x": x, "y": y},
            button=button,
            clickCount=count)

    def type_text(self, text, delay=0):
        """Type text into the active window."""
        return self.computer_action("type_text", text=text, delay=delay)

    def press_keys(self, keys):
        """Press keyboard keys."""
        return self.computer_action("press_keys", keys=keys, press="down")

    def wait(self, ms):
        """Wait for specified milliseconds."""
        return self.computer_action("wait", duration=ms)

    def switch_app(self, app):
        """Switch to specified application."""
        return self.computer_action("application", application=app)

# Usage example: Automate web search
client = BytebotClient()

# Open Firefox
client.switch_app("firefox")
client.wait(2000)

# Click on URL bar and type search query
client.click(500, 50)
client.type_text("https://www.google.com", delay=30)
client.press_keys(["enter"])
client.wait(3000)

# Take screenshot of results
screenshot_data = client.screenshot()
with open("google_home.png", "wb") as f:
    f.write(screenshot_data)

print("Screenshot saved to google_home.png")
```

## JavaScript/Node.js SDK Example

Complete Node.js example for task automation with both APIs.

```javascript
const axios = require('axios');
const fs = require('fs');

class BytebotClient {
  constructor(agentUrl = 'http://localhost:9991', desktopUrl = 'http://localhost:9990') {
    this.agentUrl = agentUrl;
    this.desktopUrl = desktopUrl;
  }

  // Task Management API
  async createTask(description, priority = 'MEDIUM') {
    const response = await axios.post(`${this.agentUrl}/tasks`, {
      description,
      priority
    });
    return response.data;
  }

  async getTasks() {
    const response = await axios.get(`${this.agentUrl}/tasks`);
    return response.data;
  }

  async getTask(taskId) {
    const response = await axios.get(`${this.agentUrl}/tasks/${taskId}`);
    return response.data;
  }

  async getInProgressTask() {
    const response = await axios.get(`${this.agentUrl}/tasks/in-progress`);
    return response.data;
  }

  // Computer Use API
  async computerAction(action, params = {}) {
    const response = await axios.post(`${this.desktopUrl}/computer-use`, {
      action,
      ...params
    });
    return response.data;
  }

  async screenshot() {
    return this.computerAction('screenshot');
  }

  async click(x, y, button = 'left', clickCount = 1) {
    return this.computerAction('click_mouse', {
      coordinates: { x, y },
      button,
      clickCount
    });
  }

  async typeText(text, delay = 0) {
    return this.computerAction('type_text', { text, delay });
  }

  async pressKeys(keys) {
    return this.computerAction('press_keys', { keys, press: 'down' });
  }

  async wait(duration) {
    return this.computerAction('wait', { duration });
  }

  async switchApp(application) {
    return this.computerAction('application', { application });
  }
}

// Usage example
async function main() {
  const client = new BytebotClient();

  // Create a task for the AI agent
  const task = await client.createTask(
    'Research the top 5 project management tools and create a comparison document',
    'HIGH'
  );
  console.log('Created task:', task.id);

  // Or use direct desktop control
  await client.switchApp('firefox');
  await client.wait(2000);
  await client.click(500, 50);
  await client.typeText('https://example.com');
  await client.pressKeys(['enter']);
  await client.wait(3000);

  // Take and save screenshot
  const result = await client.screenshot();
  if (result.success) {
    const imageBuffer = Buffer.from(result.data.image, 'base64');
    fs.writeFileSync('screenshot.png', imageBuffer);
    console.log('Screenshot saved');
  }
}

main().catch(console.error);
```

## MCP (Model Context Protocol) Integration

Connect MCP clients to access desktop control tools via Server-Sent Events.

```bash
# MCP endpoint for SSE connections
# http://localhost:9990/mcp

# Example: Configure Claude Desktop to use Bytebot MCP
# In claude_desktop_config.json:
{
  "mcpServers": {
    "bytebot": {
      "url": "http://localhost:9990/mcp"
    }
  }
}

# The MCP endpoint exposes all computer-use actions as tools:
# - screenshot: Capture desktop
# - click_mouse: Click at coordinates
# - type_text: Type text
# - press_keys: Keyboard shortcuts
# - scroll: Scroll page
# - application: Switch apps
# - read_file/write_file: File operations
```

## Helm Deployment for Kubernetes

Deploy Bytebot on Kubernetes using Helm charts for production environments.

```bash
# Clone the repository
git clone https://github.com/bytebot-ai/bytebot.git
cd bytebot

# Install with Helm (basic)
helm install bytebot ./helm \
  --set agent.env.ANTHROPIC_API_KEY=sk-ant-your-key-here

# Install with custom values
helm install bytebot ./helm \
  --set agent.env.ANTHROPIC_API_KEY=sk-ant-your-key-here \
  --set agent.env.ANTHROPIC_MODEL=claude-3-5-sonnet-20241022 \
  --set bytebot-ui.ingress.enabled=true \
  --set bytebot-ui.ingress.hosts[0].host=bytebot.example.com

# Using values file
cat > my-values.yaml << EOF
agent:
  env:
    ANTHROPIC_API_KEY: sk-ant-your-key-here
bytebot-ui:
  ingress:
    enabled: true
    hosts:
      - host: bytebot.example.com
        paths:
          - path: /
            pathType: Prefix
EOF

helm install bytebot ./helm -f my-values.yaml
```

## LiteLLM Proxy Integration

Use LiteLLM proxy to access multiple LLM providers including Azure OpenAI, AWS Bedrock, and local models.

```bash
# Start with LiteLLM proxy for multiple providers
docker-compose -f docker/docker-compose.proxy.yml up -d

# Configure LiteLLM (litellm_config.yaml example)
model_list:
  - model_name: gpt-4
    litellm_params:
      model: azure/gpt-4-deployment
      api_base: https://your-resource.openai.azure.com
      api_key: your-azure-key
      api_version: "2024-02-15-preview"

  - model_name: claude-3-sonnet
    litellm_params:
      model: anthropic/claude-3-sonnet-20240229
      api_key: sk-ant-your-key

  - model_name: gemini-pro
    litellm_params:
      model: gemini/gemini-1.5-flash
      api_key: your-gemini-key

# Environment variables for proxy mode
# LITELLM_PROXY_URL=http://litellm:4000
# LITELLM_MODEL=gpt-4
```

Bytebot serves as a powerful platform for enterprise automation, replacing traditional RPA tools with AI-powered adaptability. Primary use cases include financial operations (bank portal automation, invoice processing, reconciliation), compliance workflows (regulatory document downloads, audit trail generation), multi-system integration (bridging legacy systems without APIs), and development/QA integration (automated testing, visual regression). The platform handles authentication automatically through password manager extensions, supporting 2FA workflows without manual intervention.

Integration patterns typically involve either high-level task creation via the Agent API (port 9991) for autonomous AI-driven workflows, or low-level desktop control via the Computer Use API (port 9990) for precise automation scripts. The Agent API is ideal for complex, adaptive tasks described in natural language, while the Computer Use API provides direct programmatic control for integration with existing automation frameworks. Both APIs can be combined - create tasks via the Agent API while monitoring desktop state via the Computer Use API. The MCP endpoint enables integration with AI assistants like Claude Desktop, exposing all desktop control capabilities as tools.