### Implement VirtualTask Example

Source: https://github.com/thudm/agentbench/blob/main/docs/Extension_en.md

A concrete example of a custom task implementation inheriting from the Task class, demonstrating sample iteration and session interaction.

```python
class VirtualTask(Task):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(name="virtual-task", *args, **kwargs)

    def get_indices(self) -> List[Any]:
        return list(range(10))

    async def start_sample(self, index, session: Session):
        print("task start sample")
        for loop_times in range(3):
            await asyncio.sleep(1)
            res = await session.action({"role": "user", "content": "Loop: %d" % loop_times})
            print("TASK", res.content)
        return TaskSampleExecutionResult(status=SampleStatus.COMPLETED, result={"result": "ok"})

    def calculate_overall(self, results: List[TaskOutput]) -> Dict[str, Any]:
        return {"score": 0.4}
```

--------------------------------

### DBBench Task Configuration Example

Source: https://context7.com/thudm/agentbench/llms.txt

Example YAML configuration for the DBBench task, which evaluates LLM capabilities in performing SQL operations on MySQL databases using function calling.

```yaml
import: definition.yaml

concurrency:
  task:
    dbbench-std: 5
    os-std: 5
  agent:
    gpt-3.5-turbo-0613: 5

assignments:
  - agent:
      - gpt-3.5-turbo-0613
    task:
      - dbbench-std
      - os-std

output: "outputs/{TIMESTAMP}"
```

--------------------------------

### Start and Manage Task Controller API

Source: https://context7.com/thudm/agentbench/llms.txt

Commands to start the Task Controller on default or custom ports, and to monitor and manage task workers and sessions via its API.

```bash
python -m src.server.task_controller
python -m src.server.task_controller -p 3000
curl http://localhost:5000/api/list_workers
curl http://localhost:5000/api/list_sessions
curl -X POST http://localhost:5000/api/sync_all
curl -X POST http://localhost:5000/api/cancel_all
```

--------------------------------

### YAML Import Example

Source: https://github.com/thudm/agentbench/blob/main/docs/Config_en.md

Demonstrates how to use the 'import' keyword in YAML to include configurations from other files, supporting both single and multiple file imports. Nested imports are processed recursively.

```yaml
# config.yaml
definition:
  def1: something...
  def2: something...
```

```yaml
# def1.yaml
def1: something...

# def2.yaml
def2: something...

# config.yaml
definition:
  import:
    - def1.yaml
    - def2.yaml
```

--------------------------------

### Launch AgentBench Services with Docker Compose

Source: https://github.com/thudm/agentbench/blob/main/README.md

Command to start the AgentBench infrastructure, including the controller, task workers, and supporting services like Redis and Freebase.

```shell
docker compose -f extra/docker-compose.yml up
```

--------------------------------

### Start Task Configuration

Source: https://github.com/thudm/agentbench/blob/main/docs/Config_en.md

Describes the 'start_task.yaml' file used with 'src.start_task' for automating task worker launches. It includes fields for task definitions, starting specific tasks, and controller address.

```yaml
definition:
  import: "task_assembly.yaml"
start:
  task_name1: 5
  task_name2: 3
controller_address: "http://localhost:5000/api/"
```

--------------------------------

### Setup Docker Images for AgentBench Tasks

Source: https://github.com/thudm/agentbench/blob/main/README.md

Commands to pull or build the necessary Docker images for dbbench and os_interaction tasks. These images are required before launching the full AgentBench stack.

```shell
# dbbench
docker pull mysql:8

# os_interaction
docker build -t local-os/default -f ./data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles
docker build -t local-os/packages -f ./data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles
docker build -t local-os/ubuntu -f ./data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles
```

--------------------------------

### Automate Task Worker Launch with start_task

Source: https://context7.com/thudm/agentbench/llms.txt

Scripts to automate the bulk launching of task workers. Supports auto-controller launch, lite configurations for limited RAM, manual task starting, and custom base ports for workers.

```bash
python -m src.start_task -a
python -m src.start_task -a --config configs/start_task_lite.yaml
python -m src.start_task -s dbbench-std 5 os-std 3
python -m src.start_task -a --base-port 6001
```

--------------------------------

### Task Worker Start Script

Source: https://context7.com/thudm/agentbench/llms.txt

Automates the bulk launching of task workers based on configuration, connecting them to the controller.

```APIDOC
## Task Worker Start Script

The `start_task` module automates bulk launching of task workers based on configuration. It reads from the configuration file and connects workers to the controller automatically.

### Start Task Workers

**Method:** `python -m src.start_task`

**Parameters:**
- `-a` (flag, Optional): Automatically start the controller and workers.
- `--config` (string, Optional): Path to the configuration file (e.g., `configs/start_task_lite.yaml`).
- `-s` (string, Optional): Manually specify tasks and worker counts (e.g., `dbbench-std 5 os-std 3`).
- `--base-port` (int, Optional): Custom base port for workers (workers use ports starting from this value).

**Examples:**
```bash
# Start task workers with auto-controller (launches controller + workers)
python -m src.start_task -a

# Start with lite preset for limited RAM environments
python -m src.start_task -a --config configs/start_task_lite.yaml

# Start specific tasks manually with custom worker counts
python -m src.start_task -s dbbench-std 5 os-std 3

# Start with custom base port for workers (workers use ports 5001-500N)
python -m src.start_task -a --base-port 6001
```
```

--------------------------------

### YAML Default Keyword Example

Source: https://github.com/thudm/agentbench/blob/main/docs/Config_en.md

Illustrates the use of the 'default' keyword in YAML to specify default values for configuration parameters. It shows how default values are merged with specific values, with defaults having lower priority.

```yaml
definition:
  def1:
    type: int
    value: 1
  def2:
    type: int
    value: 2
  def3:
    type: float
    value: 1.1
```

```yaml
definition:
  default:
    type: int
  def1:
    value: 1
  def2:
    value: 2
  def3:
    type: float
    value: 1.1
```

--------------------------------

### Start New Test Case

Source: https://github.com/thudm/agentbench/blob/main/docs/Introduction_en.md

Initiates a new test case on the Task Server. This endpoint assigns the task to an available Task Worker and returns a unique session ID for tracking. It also provides the task description or initial prompt.

```APIDOC
## POST /api/start_sample

### Description
Initiates a new test case, assigning it to a Task Worker and returning a `session_id` for future reference. The response includes the task description or initial prompt.

### Method
POST

### Endpoint
/api/start_sample

### Parameters
#### Request Body
- **agent** (string) - Required - The agent to be used for the task.
- **task_id** (string) - Required - The identifier of the task to be started.
- **task_config** (object) - Optional - Configuration for the task.

### Request Example
```json
{
  "agent": "some_agent",
  "task_id": "task_123",
  "task_config": {
    "setting": "value"
  }
}
```

### Response
#### Success Response (200)
- **session_id** (string) - A unique identifier for the test case session.
- **task_description** (string) - The description of the task.
- **prompt** (string) - The initial prompt for the agent.

#### Response Example
```json
{
  "session_id": "sess_abc123",
  "task_description": "Evaluate the agent's ability to summarize text.",
  "prompt": "Please summarize the following document..."
}
```
```

--------------------------------

### YAML Overwrite Keyword Example

Source: https://github.com/thudm/agentbench/blob/main/docs/Config_en.md

Explains the 'overwrite' keyword in YAML, which functions similarly to 'default' but gives the 'overwrite' values higher priority in case of conflicts. This is useful for setting mandatory values.

```yaml
agent:
  module: "some.agent.module"
  parameters:
    overwrite:
      api_key: "your_api_key"
    default:
      timeout: 60
```

--------------------------------

### GET /api/list_sessions

Source: https://github.com/thudm/agentbench/blob/main/docs/Entrance_en.md

Retrieves a list of all active task sessions.

```APIDOC
## GET /api/list_sessions

### Description
Returns all active sessions managed by the task controller.

### Method
GET

### Endpoint
/api/list_sessions

### Parameters
None

### Response
#### Success Response (200)
- **sessions** (array) - List of active session objects.
```

--------------------------------

### GET /api/list_workers

Source: https://github.com/thudm/agentbench/blob/main/docs/Entrance_en.md

Retrieves a list of all currently registered task workers.

```APIDOC
## GET /api/list_workers

### Description
Returns a list of all task_workers currently registered with the controller.

### Method
GET

### Endpoint
/api/list_workers

### Parameters
None

### Response
#### Success Response (200)
- **workers** (array) - List of active worker objects.
```

--------------------------------

### Bash Script to Count Files Recursively

Source: https://github.com/thudm/agentbench/blob/main/data/os_interaction/scripts/5/prompt.md

This bash script recursively counts the number of regular files within a specified directory and its subdirectories. It handles regular files, directories, and symbolic links. The script is installed to /usr/local/bin and is executable. The checking script verifies its correctness against various directories.

```bash
#!/bin/bash

count_files() {
    local dir=$1
    local count=0

    for file in "$dir"/*;
    do
        if [ -f "$file" ]; then
            count=$((count + 1))
        elif [ -d "$file" ]; then
            count_sub=$(count_files "$file")
            count=$((count + count_sub))
        fi
    done

    echo "$count"
}

directory="$1"
total_count=$(count_files "$directory")
echo "$total_count"
```

```bash
#!/bin/bash

count_files() {
    # echo $1 >> tmp.log
    local dir=$1
    local count=0

    for file in "$dir"/*;
    do
        if [ -f "$file" ]; then
            count=$((count + 1))
        elif [ -d "$file" ]; then
            count_sub=$(count_files "$file")
            count=$((count + count_sub))
        fi
    done

    echo "$count"
}

# echo `count_files "/usr/local/bin"`, `count "/usr/local/bin"`

[ `count_files "/usr/local/bin"`x != `count "/usr/local/bin"`x ] && exit 1
[ `count_files "/root"`x != `count "/root"`x ] && exit 1
[ `count_files "/bin"`x != `count "/bin"`x ] && exit 1
[ `count_files "/lib"`x != `count "/lib"`x ] && exit 1
[ `count_files "/dev"`x != `count "/dev"`x ] && exit 1
[ `count_files "/usr/include"`x != `count "/usr/include"`x ] && exit 1
exit 0
```

--------------------------------

### Configure Modular YAML Settings

Source: https://context7.com/thudm/agentbench/llms.txt

Demonstrates the use of 'import', 'default', and 'overwrite' keywords to manage complex task configurations and inheritance in AgentBench.

```yaml
definition:
  import:
    - tasks/task_assembly.yaml
    - agents/api_agents.yaml

tasks:
  default:
    module: src.server.tasks.BaseTask
    parameters:
      concurrency: 32

  task1:
    parameters:
      name: "task1"

  task2:
    parameters:
      name: "task2"
      concurrency: 16

definition:
  task:
    overwrite:
      module: src.client.TaskClient
      parameters:
        controller_address: "http://localhost:5000/api"
    import: ../tasks/task_assembly.yaml
```

--------------------------------

### Configure ALFWorld Task Environment

Source: https://context7.com/thudm/agentbench/llms.txt

Sets up the ALFWorld house-holding environment, specifying data paths, prompt configurations, and agent action tools.

```yaml
default:
  module: src.server.tasks.alfworld.ALFWorld
  parameters:
    name: alfworld-std
    concurrency: 16
    data_path: "/app/data/alfworld"
    config_path: "/app/src/server/tasks/alfworld/configs/base_config.yaml"
    prompts_path: "/app/src/server/tasks/alfworld/prompts/alfworld_multiturn_plan_first.json"
    split: "new_std"
    max_step: 20
    tools:
      - type: "function"
        function:
          name: "take_action"
          description: "Take an action."
          parameters:
            type: "object"
            properties:
              action:
                type: "string"
            required:
              - "action"
```

--------------------------------

### Manage Docker Infrastructure for AgentBench

Source: https://context7.com/thudm/agentbench/llms.txt

Shell commands to pull required images, build task-specific containers, and orchestrate the full stack deployment with scaling options.

```bash
# Build Docker images required for tasks
docker pull mysql:8
docker build -t local-os/default -f ./data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles
docker build -t local-os/packages -f ./data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles
docker build -t local-os/ubuntu -f ./data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles

# Start full stack with Docker Compose
docker compose -f extra/docker-compose.yml up

# Scale specific task workers
docker compose -f extra/docker-compose.yml up --scale alfworld-std=3
```

--------------------------------

### Database Benchmark Task (DBBench)

Source: https://context7.com/thudm/agentbench/llms.txt

Evaluates LLM capability in performing SQL operations on MySQL databases using a function-calling style prompt system.

```APIDOC
## Database Benchmark Task (DBBench)

The DBBench task evaluates LLM capability in performing SQL operations on MySQL databases. Agents use function calling to execute queries and commit final answers.

### Configuration Example (`configs/tasks/dbbench.yaml`)

```yaml
# Example configuration for DBBench task
# This is a placeholder and actual configuration may vary.
module: src.client.tasks.DBBench
parameters:
  db_name: "agentbench_db"
  db_user: "user"
  db_password: "password"
  db_host: "localhost"
  init_sql: "scripts/dbbench/init.sql"
  # ... other DBBench specific parameters
```
```

--------------------------------

### WebShop Task Configuration

Source: https://context7.com/thudm/agentbench/llms.txt

Configures the WebShop environment for agent evaluation, including search and click action tools.

```APIDOC
## POST /tasks/webshop

### Description
Configures the WebShop task module with concurrency settings and available agent tools.

### Method
POST

### Endpoint
/tasks/webshop

### Parameters
#### Request Body
- **concurrency** (integer) - Required - Number of concurrent tasks
- **round** (integer) - Required - Number of evaluation rounds
- **tools** (array) - Required - List of available function tools (search_action, click_action)

### Request Example
{
  "concurrency": 64,
  "round": 20,
  "tools": [{"name": "search_action"}, {"name": "click_action"}]
}
```

--------------------------------

### Configure WebShop Task Environment

Source: https://context7.com/thudm/agentbench/llms.txt

Defines the WebShop task module, concurrency settings, and available function tools for the agent in YAML format.

```yaml
default:
  module: src.server.tasks.webshop.WebShop
  parameters:
    concurrency: 64
    round: 20
    tools:
      - type: "function"
        function:
          name: "search_action"
          description: "Use search functionality with specified keywords."
          parameters:
            type: "object"
            properties:
              keywords:
                type: "string"
            required:
              - "keywords"
      - type: "function"
        function:
          name: "click_action"
          description: "Click a button or link with a specified value."
          parameters:
            type: "object"
            properties:
              value:
                type: "string"
            required:
              - "value"
```

--------------------------------

### Implement Custom Task Class

Source: https://context7.com/thudm/agentbench/llms.txt

Demonstrates how to inherit from the Task base class to create a custom evaluation task, including sample indexing, execution logic, and metric calculation.

```python
from typing import List, Dict, Any
from src.typings import SampleIndex, TaskSampleExecutionResult, TaskOutput, SampleStatus
from src.server.task import Task, Session
import asyncio

class VirtualTask(Task):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(name="virtual-task", *args, **kwargs)
        self.data = [
            {"question": "What is 2+2?", "answer": "4"},
            {"question": "Capital of France?", "answer": "Paris"},
        ]

    def get_indices(self) -> List[SampleIndex]:
        return list(range(len(self.data)))

    async def start_sample(self, index: SampleIndex, session: Session) -> TaskSampleExecutionResult:
        sample = self.data[index]
        response = await session.action(
            {"role": "user", "content": f"Question: {sample['question']}"}
        )
        if response.status != "normal":
            return TaskSampleExecutionResult(
                status=SampleStatus.AGENT_CONTEXT_LIMIT,
                result={"error": "Agent context limit reached"}
            )
        is_correct = sample["answer"].lower() in response.content.lower()
        return TaskSampleExecutionResult(
            status=SampleStatus.COMPLETED,
            result={"correct": is_correct, "response": response.content}
        )

    def calculate_overall(self, results: List[TaskOutput]) -> Dict[str, Any]:
        correct_count = sum(1 for r in results if r.result and r.result.get("correct"))
        return {
            "accuracy": correct_count / len(results) if results else 0,
            "total": len(results),
            "correct": correct_count
        }
```

--------------------------------

### ALFWorld Task Configuration

Source: https://context7.com/thudm/agentbench/llms.txt

Configures the ALFWorld house-holding environment for text-based agent reasoning tasks.

```APIDOC
## POST /tasks/alfworld

### Description
Sets up the ALFWorld environment with specific data paths and prompt configurations.

### Method
POST

### Endpoint
/tasks/alfworld

### Parameters
#### Request Body
- **data_path** (string) - Required - Path to ALFWorld dataset
- **config_path** (string) - Required - Path to base configuration YAML
- **max_step** (integer) - Required - Maximum steps allowed per task

### Request Example
{
  "data_path": "/app/data/alfworld",
  "max_step": 20
}
```

--------------------------------

### Build Docker Images for OS Interaction (Bash)

Source: https://context7.com/thudm/agentbench/llms.txt

These bash commands build the necessary Docker images for the OS Interaction task, including the base Ubuntu image and custom images for different configurations like 'default', 'packages', and 'ubuntu'.

```bash
# Build required Docker images for OS Interaction task
docker pull ubuntu
docker build -f data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles --tag local-os/default
docker build -f data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles --tag local-os/packages
docker build -f data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles --tag local-os/ubuntu
```

--------------------------------

### Define AgentBench Services with Docker Compose

Source: https://context7.com/thudm/agentbench/llms.txt

Configures the controller and task-specific workers like alfworld and dbbench. Uses host networking and volume mounts to facilitate communication between the controller and task environments.

```yaml
name: agentbench-fc

services:
  controller:
    image: jingbh/agentrl-controller:latest
    container_name: agentrl-controller
    network_mode: host
    command:
      - controller

  alfworld-std:
    build:
      context: ..
      dockerfile: src/server/tasks/alfworld/Dockerfile
    command: --controller http://172.17.0.1:5020/api alfworld-std
    deploy:
      mode: replicated
      replicas: 1
    depends_on:
      - controller

  dbbench-std:
    build:
      context: ..
      dockerfile: src/server/tasks/dbbench/Dockerfile
    command: --controller http://172.17.0.1:5020/api dbbench-std
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - DBBENCH_STD_PARAMETERS_ENV_OPTIONS_NETWORK_NAME=agentbench-fc_default
    depends_on:
      - controller

  redis:
    image: redis:7
    container_name: redis
    network_mode: host
```

--------------------------------

### Build Docker Image for DBBench (Bash)

Source: https://context7.com/thudm/agentbench/llms.txt

This bash command pulls the MySQL 8 Docker image, which is required for the DBBench task to set up its database environment.

```bash
# Build required Docker image for DBBench
docker pull mysql:8
```

--------------------------------

### DBBench Task Configuration (YAML)

Source: https://context7.com/thudm/agentbench/llms.txt

Configuration for the DBBench task, which involves executing SQL queries against a database. It defines parameters for concurrency, maximum rounds, and available tools like 'execute_sql' and 'commit_final_answer'. The environment driver is set to Docker with specific network and state options.

```yaml
default:
  module: src.server.tasks.dbbench.DBBenchTask
  parameters:
    concurrency: 32
    max_round: 15
    tools:
      - type: "function"
        function:
          name: "execute_sql"
          description: "Executes a given SQL statement on the database and returns the result."
          parameters:
            type: "object"
            properties:
              query:
                type: "string"
                description: "The SQL query to be executed."
            required:
              - "query"
            additionalProperties: false
      - type: "function"
        function:
          name: "commit_final_answer"
          description: "Commits the final answer after all operations are completed."
          parameters:
            type: "object"
            properties:
              answers:
                type: "array"
                items:
                  type: "string"
                description: "The list of final answers to commit."
            required:
              - "answers"
            additionalProperties: false
    env_driver: docker
    env_options:
      network_name: dbbench_default
      state_driver: redis
      state_options:
        connection:
          host: 172.17.0.1
dbbench-std:
  parameters:
    name: dbbench-std
    data_file: "data/dbbench/standard.jsonl"
```

--------------------------------

### Test Agent Configuration

Source: https://context7.com/thudm/agentbench/llms.txt

Command to test agent configurations, specifically for API-based agents like GPT-3.5-turbo-0613, using a provided configuration file.

```bash
python -m src.client.agent_test --config configs/agents/api_agents.yaml --agent gpt-3.5-turbo-0613
```

--------------------------------

### Handle Agent Sessions and Interactions

Source: https://context7.com/thudm/agentbench/llms.txt

Python implementation for managing agent conversation history and processing responses using the Session interface. Includes status checking for context limits and cancellation.

```python
from src.typings import AgentOutput, AgentOutputStatus, ChatHistoryItem

async def handle_agent_interaction(session: Session):
    session.inject({"role": "user", "content": "You are a helpful assistant."})

    session.inject([
        {"role": "user", "content": "Hello"},
        {"role": "agent", "content": "Hi! How can I help?"}
    ])

    response: AgentOutput = await session.action(
        {"role": "user", "content": "What is the weather today?"}
    )

    if response.status == AgentOutputStatus.NORMAL:
        print(f"Agent response: {response.content}")
    elif response.status == AgentOutputStatus.AGENT_CONTEXT_LIMIT:
        print("Agent reached context limit")
    elif response.status == AgentOutputStatus.CANCELLED:
        print("Request was cancelled")
        return None

    return response.content
```

--------------------------------

### OS Interaction Task Configuration (YAML)

Source: https://context7.com/thudm/agentbench/llms.txt

Configuration for the OS Interaction task, designed to test LLMs in a Linux environment. It specifies parameters like concurrency and round limit, and defines tools for executing bash scripts ('bash_action'), indicating task completion ('finish_action'), and providing answers ('answer_action').

```yaml
# configs/tasks/os.yaml
default:
  module: "src.server.tasks.os_interaction.OSInteraction"
  parameters:
    concurrency: 32
    round_limit: 8
    tools:
      - type: "function"
        function:
          name: "bash_action"
          description: "Execute bash code to perform an operation in the Linux environment."
          parameters:
            type: "object"
            properties:
              script:
                type: "string"
                description: "The bash script to be executed."
            required:
              - "script"
            additionalProperties: false
      - type: "function"
        function:
          name: "finish_action"
          description: "Indicate that the task has been finished or need some additional information."
          parameters:
            type: "object"
            properties:
              thought:
                type: "string"
                description: "The thought or reason indicating the task is finished."
            required:
              - "thought"
            additionalProperties: false
      - type: "function"
        function:
          name: "answer_action"
          description: "Provide the answer to the question."
          parameters:
            type: "object"
            properties:
              answer:
                type: "string"
                description: "The answer to the question."
            required:
              - "answer"
            additionalProperties: false
    docker_config:
      localhost: local-os
      directory: data/os_interaction/res/dockerfiles
    env_driver: docker
    env_options:
      network_name: os_interaction_default
      state_driver: redis
      state_options:
        connection:
          host: 172.17.0.1
```

--------------------------------

### Task Configuration Structure

Source: https://github.com/thudm/agentbench/blob/main/docs/Config_en.md

Outlines the structure for task configuration files within the 'tasks' directory. It requires 'module' and 'parameters' and is typically used for defining reusable task components.

```yaml
task_name:
  module: "task_module_path"
  parameters:
    param1: value1
    param2: value2
```

--------------------------------

### Agent Configuration (OpenAI Chat API)

Source: https://context7.com/thudm/agentbench/llms.txt

Configure HTTP-based agents to connect to OpenAI or compatible API endpoints, supporting customizable request formatting, authentication, and response parsing.

```APIDOC
## Agent Configuration (OpenAI Chat API)

Configure HTTP-based agents to connect to OpenAI or compatible API endpoints. The HTTPAgent class supports customizable request formatting, authentication headers, and response parsing.

### Example Configuration (`configs/agents/openai-chat.yaml`)

```yaml
module: src.client.agents.HTTPAgent
parameters:
  url: https://api.openai.com/v1/chat/completions
  headers:
    Content-Type: application/json
    Authorization: Bearer sk-your-api-key-here
  body:
    temperature: 0
    max_tokens: 512
    model: gpt-3.5-turbo-0613
  prompter:
    name: role_content_dict
    args:
      agent_role: assistant
  return_format: "{response[choices][0][message][content]}"
```

### Test Agent Configuration

**Method:** `python -m src.client.agent_test`

**Parameters:**
- `--config` (string): Path to the agent configuration file (e.g., `configs/agents/api_agents.yaml`).
- `--agent` (string): The name of the agent to test (e.g., `gpt-3.5-turbo-0613`).

**Example:**
```bash
python -m src.client.agent_test --config configs/agents/api_agents.yaml --agent gpt-3.5-turbo-0613
```
```

--------------------------------

### AgentBench Data Structures

Source: https://github.com/thudm/agentbench/blob/main/docs/Extension_en.md

Core Pydantic models and Enums used for tracking task execution status, chat history, and agent outputs.

```python
class TaskSampleExecutionResult(BaseModel):
    status: SampleStatus = SampleStatus.COMPLETED
    result: JSONSerializable = None

class TaskOutput(BaseModel):
    index: Union[None, SampleIndex] = None
    status: SampleStatus = SampleStatus.RUNNING
    result: JSONSerializable = None
    history: Union[None, List[ChatHistoryItem]] = None

class AgentOutput(BaseModel):
    status: AgentOutputStatus = AgentOutputStatus.NORMAL
    content: Union[str, None] = None
```

--------------------------------

### POST /api/sync_all

Source: https://github.com/thudm/agentbench/blob/main/docs/Entrance_en.md

Synchronizes all sessions running on task workers.

```APIDOC
## POST /api/sync_all

### Description
Syncs all sessions running on task_workers. This should be called if the controller restarts unexpectedly to ensure state consistency.

### Method
POST

### Endpoint
/api/sync_all

### Parameters
None

### Response
#### Success Response (200)
- **status** (string) - Confirmation of synchronization.
```

--------------------------------

### Session API

Source: https://github.com/thudm/agentbench/blob/main/docs/Extension_en.md

The Session interface allows tasks to interact with the Agent, enabling history injection and action triggering.

```APIDOC
## Session API

### Description
Interface for communicating with the Agent during sample execution.

### Methods
- **inject(item)**: Adds one or more ChatHistoryItem objects to the agent's history.
- **action(*injection)**: Sends a prompt to the agent and waits for an AgentOutput response. Supports optional history injection during the call.

### AgentOutput Structure
- **status**: AgentOutputStatus (NORMAL, CANCELLED, AGENT_CONTEXT_LIMIT)
- **content**: The string response from the agent.
```

--------------------------------

### Knowledge Graph Task Configuration

Source: https://context7.com/thudm/agentbench/llms.txt

Configures the Knowledge Graph task for querying large-scale knowledge bases via SPARQL.

```APIDOC
## POST /tasks/knowledgegraph

### Description
Configures the Knowledge Graph task environment, including the SPARQL endpoint URL.

### Method
POST

### Endpoint
/tasks/knowledgegraph

### Parameters
#### Request Body
- **env_options** (object) - Required - Contains the 'urls' mapping for the SPARQL endpoint
- **concurrency** (integer) - Required - Number of concurrent queries

### Request Example
{
  "env_options": {"urls": {"kg": "http://localhost:3001/sparql"}},
  "concurrency": 32
}
```

--------------------------------

### Configure Knowledge Graph Task

Source: https://context7.com/thudm/agentbench/llms.txt

Configures the Knowledge Graph task to interface with a SPARQL endpoint for reasoning over large-scale data.

```yaml
default:
  module: "src.server.tasks.knowledgegraph.KnowledgeGraph"
  parameters:
    concurrency: 32
    max_rounds: 15
    one_shot: false
    env_driver: manual
    env_options:
      urls:
        kg: http://localhost:3001/sparql
```

--------------------------------

### Agent Configuration Structure

Source: https://github.com/thudm/agentbench/blob/main/docs/Config_en.md

Defines the structure for agent configuration files within the 'agents' directory. It specifies the required 'module' and 'parameters' fields for each agent.

```yaml
agent_name:
  module: "agent_module_path"
  parameters:
    param1: value1
    param2: value2
```

--------------------------------

### Define Custom Task Interface in Python

Source: https://github.com/thudm/agentbench/blob/main/docs/Extension_en.md

The Task base class provides the structure for implementing new benchmarks. Users must implement get_indices, start_sample, and calculate_overall to define task behavior.

```python
class Task:
    def __init__(self, name: str, concurrency: int = 1, *args, **kwargs):
        self.name = name
        self.concurrency = concurrency

    def get_indices(self) -> List[SampleIndex]:
        raise NotImplementedError()

    async def start_sample(self, index: SampleIndex, session: Session) -> TaskSampleExecutionResult:
        raise NotImplementedError()

    def calculate_overall(self, results: List[TaskOutput]) -> Dict[str, Any]:
        raise NotImplementedError()

    def release(self):
        pass
```

--------------------------------

### Orchestrate Evaluations with Assigner Script

Source: https://context7.com/thudm/agentbench/llms.txt

The Assigner script orchestrates LLM evaluations by managing task assignments, concurrency, and result saving. It supports custom configurations and auto-retry for failed samples.

```bash
python -m src.assigner
python -m src.assigner --config configs/assignments/default.yaml
python -m src.assigner --auto-retry
```

--------------------------------

### Configure OpenAI Chat API Agent

Source: https://context7.com/thudm/agentbench/llms.txt

YAML configuration for an HTTPAgent to connect to OpenAI or compatible API endpoints. It specifies the URL, headers, request body parameters, prompter, and response parsing format.

```yaml
module: src.client.agents.HTTPAgent
parameters:
  url: https://api.openai.com/v1/chat/completions
  headers:
    Content-Type: application/json
    Authorization: Bearer sk-your-api-key-here
  body:
    temperature: 0
    max_tokens: 512
    model: gpt-3.5-turbo-0613
  prompter:
    name: role_content_dict
    args:
      agent_role: assistant
  return_format: "{response[choices][0][message][content]}"
```

--------------------------------

### Evaluation Assigner

Source: https://context7.com/thudm/agentbench/llms.txt

Orchestrates evaluations by managing concurrent task execution, assigning samples using a maximum flow algorithm, and saving results.

```APIDOC
## Evaluation Assigner

The Assigner script orchestrates evaluations by reading configuration files, managing concurrent task execution, and saving results in real-time. It uses a maximum flow algorithm to optimally assign samples to available agent-task worker pairs.

### Run Evaluation

**Method:** `python -m src.assigner`

**Parameters:**
- `--config` (string, Optional): Path to the custom configuration file (e.g., `configs/assignments/default.yaml`).
- `--auto-retry` (flag, Optional): Enable auto-retry for failed samples.

**Example Configuration (`configs/assignments/default.yaml`):**
```yaml
import: definition.yaml

concurrency:
  task:
    dbbench-std: 5
    os-std: 5
  agent:
    gpt-3.5-turbo-0613: 5

assignments:
  - agent:
      - gpt-3.5-turbo-0613
    task:
      - dbbench-std
      - os-std

output: "outputs/{TIMESTAMP}"
```

**Examples:**
```bash
# Run evaluation with default configuration
python -m src.assigner

# Run with custom configuration file
python -m src.assigner --config configs/assignments/default.yaml

# Run with auto-retry for failed samples
python -m src.assigner --auto-retry
```
```

--------------------------------

### Interact with Task

Source: https://github.com/thudm/agentbench/blob/main/docs/Introduction_en.md

Facilitates interaction between an Agent and a Task. This endpoint receives the Agent's output, forwards it to the corresponding Task Worker, and returns the output from the Task Worker (the task environment's response).

```APIDOC
## POST /api/interact

### Description
Allows an Agent to interact with a Task. The Agent's output is sent to the Task Worker, and the Task Worker's response (from the task environment) is returned.

### Method
POST

### Endpoint
/api/interact

### Parameters
#### Request Body
- **session_id** (string) - Required - The ID of the current session.
- **agent_output** (string) - Required - The output from the agent.

### Request Example
```json
{
  "session_id": "sess_abc123",
  "agent_output": "The summary is: ..."
}
```

### Response
#### Success Response (200)
- **task_output** (string) - The output from the task environment.

#### Response Example
```json
{
  "task_output": "Evaluation complete. Agent performed well."
}
```
```

--------------------------------

### Task Controller API

Source: https://context7.com/thudm/agentbench/llms.txt

The Task Controller manages task workers and provides interfaces for client communication. It runs on port 5000 by default and exposes monitoring and control endpoints.

```APIDOC
## Task Controller API

The Task Controller is the central component managing all task workers and providing unified interfaces for client communication. It runs on port 5000 by default and exposes monitoring and control endpoints.

### Start Task Controller

Starts the task controller on the default port 5000 or a custom port.

**Method:** `python -m src.server.task_controller`

**Parameters:**
- `-p` (int, Optional): Custom port number to run the controller on.

**Example:**
```bash
# Start on default port 5000
python -m src.server.task_controller

# Start on custom port 3000
python -m src.server.task_controller -p 3000
```

### List Workers

Monitors registered task workers.

**Method:** `curl`

**Endpoint:** `http://localhost:5000/api/list_workers`

### List Sessions

Lists all active sessions.

**Method:** `curl`

**Endpoint:** `http://localhost:5000/api/list_sessions`

### Sync All Sessions

Synchronizes all sessions after an unexpected controller restart.

**Method:** `curl -X POST`

**Endpoint:** `http://localhost:5000/api/sync_all`

### Cancel All Sessions

Cancels all running sessions.

**Method:** `curl -X POST`

**Endpoint:** `http://localhost:5000/api/cancel_all`
```

--------------------------------

### POST /api/cancel_all

Source: https://github.com/thudm/agentbench/blob/main/docs/Entrance_en.md

Cancels all currently running sessions across all task workers.

```APIDOC
## POST /api/cancel_all

### Description
Cancels all sessions currently running on task_workers.

### Method
POST

### Endpoint
/api/cancel_all

### Parameters
None

### Response
#### Success Response (200)
- **status** (string) - Confirmation of cancellation.
```

--------------------------------

### Task Interface Definition

Source: https://github.com/thudm/agentbench/blob/main/docs/Extension_en.md

The base Task class provides the interface for defining custom evaluation tasks, including sample retrieval, execution logic, and result calculation.

```APIDOC
## Task Interface

### Description
Base class for creating custom tasks in AgentBench. Users must inherit from this class and implement the required lifecycle methods.

### Methods
- **get_indices()**: Returns a list of sample indices to be processed.
- **start_sample(index, session)**: Executes logic for a single sample using the provided session proxy.
- **calculate_overall(results)**: Aggregates results from all samples into a final score dictionary.
- **release()**: Optional cleanup method executed after the worker process ends.

### Data Structures
- **TaskSampleExecutionResult**: Contains the status and result of a single sample execution.
- **TaskOutput**: Represents the full output of a task, including history and status.
- **SampleStatus**: Enum representing the outcome (e.g., COMPLETED, AGENT_CONTEXT_LIMIT, TASK_ERROR).
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.