### Implement VirtualTask Example Source: https://github.com/thudm/agentbench/blob/main/docs/Extension_en.md A concrete example of a custom task implementation inheriting from the Task class, demonstrating sample iteration and session interaction. ```python class VirtualTask(Task): def __init__(self, *args, **kwargs) -> None: super().__init__(name="virtual-task", *args, **kwargs) def get_indices(self) -> List[Any]: return list(range(10)) async def start_sample(self, index, session: Session): print("task start sample") for loop_times in range(3): await asyncio.sleep(1) res = await session.action({"role": "user", "content": "Loop: %d" % loop_times}) print("TASK", res.content) return TaskSampleExecutionResult(status=SampleStatus.COMPLETED, result={"result": "ok"}) def calculate_overall(self, results: List[TaskOutput]) -> Dict[str, Any]: return {"score": 0.4} ``` -------------------------------- ### DBBench Task Configuration Example Source: https://context7.com/thudm/agentbench/llms.txt Example YAML configuration for the DBBench task, which evaluates LLM capabilities in performing SQL operations on MySQL databases using function calling. ```yaml import: definition.yaml concurrency: task: dbbench-std: 5 os-std: 5 agent: gpt-3.5-turbo-0613: 5 assignments: - agent: - gpt-3.5-turbo-0613 task: - dbbench-std - os-std output: "outputs/{TIMESTAMP}" ``` -------------------------------- ### Start and Manage Task Controller API Source: https://context7.com/thudm/agentbench/llms.txt Commands to start the Task Controller on default or custom ports, and to monitor and manage task workers and sessions via its API. ```bash python -m src.server.task_controller python -m src.server.task_controller -p 3000 curl http://localhost:5000/api/list_workers curl http://localhost:5000/api/list_sessions curl -X POST http://localhost:5000/api/sync_all curl -X POST http://localhost:5000/api/cancel_all ``` -------------------------------- ### YAML Import Example Source: https://github.com/thudm/agentbench/blob/main/docs/Config_en.md Demonstrates how to use the 'import' keyword in YAML to include configurations from other files, supporting both single and multiple file imports. Nested imports are processed recursively. ```yaml # config.yaml definition: def1: something... def2: something... ``` ```yaml # def1.yaml def1: something... # def2.yaml def2: something... # config.yaml definition: import: - def1.yaml - def2.yaml ``` -------------------------------- ### Launch AgentBench Services with Docker Compose Source: https://github.com/thudm/agentbench/blob/main/README.md Command to start the AgentBench infrastructure, including the controller, task workers, and supporting services like Redis and Freebase. ```shell docker compose -f extra/docker-compose.yml up ``` -------------------------------- ### Start Task Configuration Source: https://github.com/thudm/agentbench/blob/main/docs/Config_en.md Describes the 'start_task.yaml' file used with 'src.start_task' for automating task worker launches. It includes fields for task definitions, starting specific tasks, and controller address. ```yaml definition: import: "task_assembly.yaml" start: task_name1: 5 task_name2: 3 controller_address: "http://localhost:5000/api/" ``` -------------------------------- ### Setup Docker Images for AgentBench Tasks Source: https://github.com/thudm/agentbench/blob/main/README.md Commands to pull or build the necessary Docker images for dbbench and os_interaction tasks. These images are required before launching the full AgentBench stack. ```shell # dbbench docker pull mysql:8 # os_interaction docker build -t local-os/default -f ./data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles docker build -t local-os/packages -f ./data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles docker build -t local-os/ubuntu -f ./data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles ``` -------------------------------- ### Automate Task Worker Launch with start_task Source: https://context7.com/thudm/agentbench/llms.txt Scripts to automate the bulk launching of task workers. Supports auto-controller launch, lite configurations for limited RAM, manual task starting, and custom base ports for workers. ```bash python -m src.start_task -a python -m src.start_task -a --config configs/start_task_lite.yaml python -m src.start_task -s dbbench-std 5 os-std 3 python -m src.start_task -a --base-port 6001 ``` -------------------------------- ### Task Worker Start Script Source: https://context7.com/thudm/agentbench/llms.txt Automates the bulk launching of task workers based on configuration, connecting them to the controller. ```APIDOC ## Task Worker Start Script The `start_task` module automates bulk launching of task workers based on configuration. It reads from the configuration file and connects workers to the controller automatically. ### Start Task Workers **Method:** `python -m src.start_task` **Parameters:** - `-a` (flag, Optional): Automatically start the controller and workers. - `--config` (string, Optional): Path to the configuration file (e.g., `configs/start_task_lite.yaml`). - `-s` (string, Optional): Manually specify tasks and worker counts (e.g., `dbbench-std 5 os-std 3`). - `--base-port` (int, Optional): Custom base port for workers (workers use ports starting from this value). **Examples:** ```bash # Start task workers with auto-controller (launches controller + workers) python -m src.start_task -a # Start with lite preset for limited RAM environments python -m src.start_task -a --config configs/start_task_lite.yaml # Start specific tasks manually with custom worker counts python -m src.start_task -s dbbench-std 5 os-std 3 # Start with custom base port for workers (workers use ports 5001-500N) python -m src.start_task -a --base-port 6001 ``` ``` -------------------------------- ### YAML Default Keyword Example Source: https://github.com/thudm/agentbench/blob/main/docs/Config_en.md Illustrates the use of the 'default' keyword in YAML to specify default values for configuration parameters. It shows how default values are merged with specific values, with defaults having lower priority. ```yaml definition: def1: type: int value: 1 def2: type: int value: 2 def3: type: float value: 1.1 ``` ```yaml definition: default: type: int def1: value: 1 def2: value: 2 def3: type: float value: 1.1 ``` -------------------------------- ### Start New Test Case Source: https://github.com/thudm/agentbench/blob/main/docs/Introduction_en.md Initiates a new test case on the Task Server. This endpoint assigns the task to an available Task Worker and returns a unique session ID for tracking. It also provides the task description or initial prompt. ```APIDOC ## POST /api/start_sample ### Description Initiates a new test case, assigning it to a Task Worker and returning a `session_id` for future reference. The response includes the task description or initial prompt. ### Method POST ### Endpoint /api/start_sample ### Parameters #### Request Body - **agent** (string) - Required - The agent to be used for the task. - **task_id** (string) - Required - The identifier of the task to be started. - **task_config** (object) - Optional - Configuration for the task. ### Request Example ```json { "agent": "some_agent", "task_id": "task_123", "task_config": { "setting": "value" } } ``` ### Response #### Success Response (200) - **session_id** (string) - A unique identifier for the test case session. - **task_description** (string) - The description of the task. - **prompt** (string) - The initial prompt for the agent. #### Response Example ```json { "session_id": "sess_abc123", "task_description": "Evaluate the agent's ability to summarize text.", "prompt": "Please summarize the following document..." } ``` ``` -------------------------------- ### YAML Overwrite Keyword Example Source: https://github.com/thudm/agentbench/blob/main/docs/Config_en.md Explains the 'overwrite' keyword in YAML, which functions similarly to 'default' but gives the 'overwrite' values higher priority in case of conflicts. This is useful for setting mandatory values. ```yaml agent: module: "some.agent.module" parameters: overwrite: api_key: "your_api_key" default: timeout: 60 ``` -------------------------------- ### GET /api/list_sessions Source: https://github.com/thudm/agentbench/blob/main/docs/Entrance_en.md Retrieves a list of all active task sessions. ```APIDOC ## GET /api/list_sessions ### Description Returns all active sessions managed by the task controller. ### Method GET ### Endpoint /api/list_sessions ### Parameters None ### Response #### Success Response (200) - **sessions** (array) - List of active session objects. ``` -------------------------------- ### GET /api/list_workers Source: https://github.com/thudm/agentbench/blob/main/docs/Entrance_en.md Retrieves a list of all currently registered task workers. ```APIDOC ## GET /api/list_workers ### Description Returns a list of all task_workers currently registered with the controller. ### Method GET ### Endpoint /api/list_workers ### Parameters None ### Response #### Success Response (200) - **workers** (array) - List of active worker objects. ``` -------------------------------- ### Bash Script to Count Files Recursively Source: https://github.com/thudm/agentbench/blob/main/data/os_interaction/scripts/5/prompt.md This bash script recursively counts the number of regular files within a specified directory and its subdirectories. It handles regular files, directories, and symbolic links. The script is installed to /usr/local/bin and is executable. The checking script verifies its correctness against various directories. ```bash #!/bin/bash count_files() { local dir=$1 local count=0 for file in "$dir"/*; do if [ -f "$file" ]; then count=$((count + 1)) elif [ -d "$file" ]; then count_sub=$(count_files "$file") count=$((count + count_sub)) fi done echo "$count" } directory="$1" total_count=$(count_files "$directory") echo "$total_count" ``` ```bash #!/bin/bash count_files() { # echo $1 >> tmp.log local dir=$1 local count=0 for file in "$dir"/*; do if [ -f "$file" ]; then count=$((count + 1)) elif [ -d "$file" ]; then count_sub=$(count_files "$file") count=$((count + count_sub)) fi done echo "$count" } # echo `count_files "/usr/local/bin"`, `count "/usr/local/bin"` [ `count_files "/usr/local/bin"`x != `count "/usr/local/bin"`x ] && exit 1 [ `count_files "/root"`x != `count "/root"`x ] && exit 1 [ `count_files "/bin"`x != `count "/bin"`x ] && exit 1 [ `count_files "/lib"`x != `count "/lib"`x ] && exit 1 [ `count_files "/dev"`x != `count "/dev"`x ] && exit 1 [ `count_files "/usr/include"`x != `count "/usr/include"`x ] && exit 1 exit 0 ``` -------------------------------- ### Configure Modular YAML Settings Source: https://context7.com/thudm/agentbench/llms.txt Demonstrates the use of 'import', 'default', and 'overwrite' keywords to manage complex task configurations and inheritance in AgentBench. ```yaml definition: import: - tasks/task_assembly.yaml - agents/api_agents.yaml tasks: default: module: src.server.tasks.BaseTask parameters: concurrency: 32 task1: parameters: name: "task1" task2: parameters: name: "task2" concurrency: 16 definition: task: overwrite: module: src.client.TaskClient parameters: controller_address: "http://localhost:5000/api" import: ../tasks/task_assembly.yaml ``` -------------------------------- ### Configure ALFWorld Task Environment Source: https://context7.com/thudm/agentbench/llms.txt Sets up the ALFWorld house-holding environment, specifying data paths, prompt configurations, and agent action tools. ```yaml default: module: src.server.tasks.alfworld.ALFWorld parameters: name: alfworld-std concurrency: 16 data_path: "/app/data/alfworld" config_path: "/app/src/server/tasks/alfworld/configs/base_config.yaml" prompts_path: "/app/src/server/tasks/alfworld/prompts/alfworld_multiturn_plan_first.json" split: "new_std" max_step: 20 tools: - type: "function" function: name: "take_action" description: "Take an action." parameters: type: "object" properties: action: type: "string" required: - "action" ``` -------------------------------- ### Manage Docker Infrastructure for AgentBench Source: https://context7.com/thudm/agentbench/llms.txt Shell commands to pull required images, build task-specific containers, and orchestrate the full stack deployment with scaling options. ```bash # Build Docker images required for tasks docker pull mysql:8 docker build -t local-os/default -f ./data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles docker build -t local-os/packages -f ./data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles docker build -t local-os/ubuntu -f ./data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles # Start full stack with Docker Compose docker compose -f extra/docker-compose.yml up # Scale specific task workers docker compose -f extra/docker-compose.yml up --scale alfworld-std=3 ``` -------------------------------- ### Database Benchmark Task (DBBench) Source: https://context7.com/thudm/agentbench/llms.txt Evaluates LLM capability in performing SQL operations on MySQL databases using a function-calling style prompt system. ```APIDOC ## Database Benchmark Task (DBBench) The DBBench task evaluates LLM capability in performing SQL operations on MySQL databases. Agents use function calling to execute queries and commit final answers. ### Configuration Example (`configs/tasks/dbbench.yaml`) ```yaml # Example configuration for DBBench task # This is a placeholder and actual configuration may vary. module: src.client.tasks.DBBench parameters: db_name: "agentbench_db" db_user: "user" db_password: "password" db_host: "localhost" init_sql: "scripts/dbbench/init.sql" # ... other DBBench specific parameters ``` ``` -------------------------------- ### WebShop Task Configuration Source: https://context7.com/thudm/agentbench/llms.txt Configures the WebShop environment for agent evaluation, including search and click action tools. ```APIDOC ## POST /tasks/webshop ### Description Configures the WebShop task module with concurrency settings and available agent tools. ### Method POST ### Endpoint /tasks/webshop ### Parameters #### Request Body - **concurrency** (integer) - Required - Number of concurrent tasks - **round** (integer) - Required - Number of evaluation rounds - **tools** (array) - Required - List of available function tools (search_action, click_action) ### Request Example { "concurrency": 64, "round": 20, "tools": [{"name": "search_action"}, {"name": "click_action"}] } ``` -------------------------------- ### Configure WebShop Task Environment Source: https://context7.com/thudm/agentbench/llms.txt Defines the WebShop task module, concurrency settings, and available function tools for the agent in YAML format. ```yaml default: module: src.server.tasks.webshop.WebShop parameters: concurrency: 64 round: 20 tools: - type: "function" function: name: "search_action" description: "Use search functionality with specified keywords." parameters: type: "object" properties: keywords: type: "string" required: - "keywords" - type: "function" function: name: "click_action" description: "Click a button or link with a specified value." parameters: type: "object" properties: value: type: "string" required: - "value" ``` -------------------------------- ### Implement Custom Task Class Source: https://context7.com/thudm/agentbench/llms.txt Demonstrates how to inherit from the Task base class to create a custom evaluation task, including sample indexing, execution logic, and metric calculation. ```python from typing import List, Dict, Any from src.typings import SampleIndex, TaskSampleExecutionResult, TaskOutput, SampleStatus from src.server.task import Task, Session import asyncio class VirtualTask(Task): def __init__(self, *args, **kwargs) -> None: super().__init__(name="virtual-task", *args, **kwargs) self.data = [ {"question": "What is 2+2?", "answer": "4"}, {"question": "Capital of France?", "answer": "Paris"}, ] def get_indices(self) -> List[SampleIndex]: return list(range(len(self.data))) async def start_sample(self, index: SampleIndex, session: Session) -> TaskSampleExecutionResult: sample = self.data[index] response = await session.action( {"role": "user", "content": f"Question: {sample['question']}"} ) if response.status != "normal": return TaskSampleExecutionResult( status=SampleStatus.AGENT_CONTEXT_LIMIT, result={"error": "Agent context limit reached"} ) is_correct = sample["answer"].lower() in response.content.lower() return TaskSampleExecutionResult( status=SampleStatus.COMPLETED, result={"correct": is_correct, "response": response.content} ) def calculate_overall(self, results: List[TaskOutput]) -> Dict[str, Any]: correct_count = sum(1 for r in results if r.result and r.result.get("correct")) return { "accuracy": correct_count / len(results) if results else 0, "total": len(results), "correct": correct_count } ``` -------------------------------- ### ALFWorld Task Configuration Source: https://context7.com/thudm/agentbench/llms.txt Configures the ALFWorld house-holding environment for text-based agent reasoning tasks. ```APIDOC ## POST /tasks/alfworld ### Description Sets up the ALFWorld environment with specific data paths and prompt configurations. ### Method POST ### Endpoint /tasks/alfworld ### Parameters #### Request Body - **data_path** (string) - Required - Path to ALFWorld dataset - **config_path** (string) - Required - Path to base configuration YAML - **max_step** (integer) - Required - Maximum steps allowed per task ### Request Example { "data_path": "/app/data/alfworld", "max_step": 20 } ``` -------------------------------- ### Build Docker Images for OS Interaction (Bash) Source: https://context7.com/thudm/agentbench/llms.txt These bash commands build the necessary Docker images for the OS Interaction task, including the base Ubuntu image and custom images for different configurations like 'default', 'packages', and 'ubuntu'. ```bash # Build required Docker images for OS Interaction task docker pull ubuntu docker build -f data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles --tag local-os/default docker build -f data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles --tag local-os/packages docker build -f data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles --tag local-os/ubuntu ``` -------------------------------- ### Define AgentBench Services with Docker Compose Source: https://context7.com/thudm/agentbench/llms.txt Configures the controller and task-specific workers like alfworld and dbbench. Uses host networking and volume mounts to facilitate communication between the controller and task environments. ```yaml name: agentbench-fc services: controller: image: jingbh/agentrl-controller:latest container_name: agentrl-controller network_mode: host command: - controller alfworld-std: build: context: .. dockerfile: src/server/tasks/alfworld/Dockerfile command: --controller http://172.17.0.1:5020/api alfworld-std deploy: mode: replicated replicas: 1 depends_on: - controller dbbench-std: build: context: .. dockerfile: src/server/tasks/dbbench/Dockerfile command: --controller http://172.17.0.1:5020/api dbbench-std volumes: - /var/run/docker.sock:/var/run/docker.sock environment: - DBBENCH_STD_PARAMETERS_ENV_OPTIONS_NETWORK_NAME=agentbench-fc_default depends_on: - controller redis: image: redis:7 container_name: redis network_mode: host ``` -------------------------------- ### Build Docker Image for DBBench (Bash) Source: https://context7.com/thudm/agentbench/llms.txt This bash command pulls the MySQL 8 Docker image, which is required for the DBBench task to set up its database environment. ```bash # Build required Docker image for DBBench docker pull mysql:8 ``` -------------------------------- ### DBBench Task Configuration (YAML) Source: https://context7.com/thudm/agentbench/llms.txt Configuration for the DBBench task, which involves executing SQL queries against a database. It defines parameters for concurrency, maximum rounds, and available tools like 'execute_sql' and 'commit_final_answer'. The environment driver is set to Docker with specific network and state options. ```yaml default: module: src.server.tasks.dbbench.DBBenchTask parameters: concurrency: 32 max_round: 15 tools: - type: "function" function: name: "execute_sql" description: "Executes a given SQL statement on the database and returns the result." parameters: type: "object" properties: query: type: "string" description: "The SQL query to be executed." required: - "query" additionalProperties: false - type: "function" function: name: "commit_final_answer" description: "Commits the final answer after all operations are completed." parameters: type: "object" properties: answers: type: "array" items: type: "string" description: "The list of final answers to commit." required: - "answers" additionalProperties: false env_driver: docker env_options: network_name: dbbench_default state_driver: redis state_options: connection: host: 172.17.0.1 dbbench-std: parameters: name: dbbench-std data_file: "data/dbbench/standard.jsonl" ``` -------------------------------- ### Test Agent Configuration Source: https://context7.com/thudm/agentbench/llms.txt Command to test agent configurations, specifically for API-based agents like GPT-3.5-turbo-0613, using a provided configuration file. ```bash python -m src.client.agent_test --config configs/agents/api_agents.yaml --agent gpt-3.5-turbo-0613 ``` -------------------------------- ### Handle Agent Sessions and Interactions Source: https://context7.com/thudm/agentbench/llms.txt Python implementation for managing agent conversation history and processing responses using the Session interface. Includes status checking for context limits and cancellation. ```python from src.typings import AgentOutput, AgentOutputStatus, ChatHistoryItem async def handle_agent_interaction(session: Session): session.inject({"role": "user", "content": "You are a helpful assistant."}) session.inject([ {"role": "user", "content": "Hello"}, {"role": "agent", "content": "Hi! How can I help?"} ]) response: AgentOutput = await session.action( {"role": "user", "content": "What is the weather today?"} ) if response.status == AgentOutputStatus.NORMAL: print(f"Agent response: {response.content}") elif response.status == AgentOutputStatus.AGENT_CONTEXT_LIMIT: print("Agent reached context limit") elif response.status == AgentOutputStatus.CANCELLED: print("Request was cancelled") return None return response.content ``` -------------------------------- ### OS Interaction Task Configuration (YAML) Source: https://context7.com/thudm/agentbench/llms.txt Configuration for the OS Interaction task, designed to test LLMs in a Linux environment. It specifies parameters like concurrency and round limit, and defines tools for executing bash scripts ('bash_action'), indicating task completion ('finish_action'), and providing answers ('answer_action'). ```yaml # configs/tasks/os.yaml default: module: "src.server.tasks.os_interaction.OSInteraction" parameters: concurrency: 32 round_limit: 8 tools: - type: "function" function: name: "bash_action" description: "Execute bash code to perform an operation in the Linux environment." parameters: type: "object" properties: script: type: "string" description: "The bash script to be executed." required: - "script" additionalProperties: false - type: "function" function: name: "finish_action" description: "Indicate that the task has been finished or need some additional information." parameters: type: "object" properties: thought: type: "string" description: "The thought or reason indicating the task is finished." required: - "thought" additionalProperties: false - type: "function" function: name: "answer_action" description: "Provide the answer to the question." parameters: type: "object" properties: answer: type: "string" description: "The answer to the question." required: - "answer" additionalProperties: false docker_config: localhost: local-os directory: data/os_interaction/res/dockerfiles env_driver: docker env_options: network_name: os_interaction_default state_driver: redis state_options: connection: host: 172.17.0.1 ``` -------------------------------- ### Task Configuration Structure Source: https://github.com/thudm/agentbench/blob/main/docs/Config_en.md Outlines the structure for task configuration files within the 'tasks' directory. It requires 'module' and 'parameters' and is typically used for defining reusable task components. ```yaml task_name: module: "task_module_path" parameters: param1: value1 param2: value2 ``` -------------------------------- ### Agent Configuration (OpenAI Chat API) Source: https://context7.com/thudm/agentbench/llms.txt Configure HTTP-based agents to connect to OpenAI or compatible API endpoints, supporting customizable request formatting, authentication, and response parsing. ```APIDOC ## Agent Configuration (OpenAI Chat API) Configure HTTP-based agents to connect to OpenAI or compatible API endpoints. The HTTPAgent class supports customizable request formatting, authentication headers, and response parsing. ### Example Configuration (`configs/agents/openai-chat.yaml`) ```yaml module: src.client.agents.HTTPAgent parameters: url: https://api.openai.com/v1/chat/completions headers: Content-Type: application/json Authorization: Bearer sk-your-api-key-here body: temperature: 0 max_tokens: 512 model: gpt-3.5-turbo-0613 prompter: name: role_content_dict args: agent_role: assistant return_format: "{response[choices][0][message][content]}" ``` ### Test Agent Configuration **Method:** `python -m src.client.agent_test` **Parameters:** - `--config` (string): Path to the agent configuration file (e.g., `configs/agents/api_agents.yaml`). - `--agent` (string): The name of the agent to test (e.g., `gpt-3.5-turbo-0613`). **Example:** ```bash python -m src.client.agent_test --config configs/agents/api_agents.yaml --agent gpt-3.5-turbo-0613 ``` ``` -------------------------------- ### AgentBench Data Structures Source: https://github.com/thudm/agentbench/blob/main/docs/Extension_en.md Core Pydantic models and Enums used for tracking task execution status, chat history, and agent outputs. ```python class TaskSampleExecutionResult(BaseModel): status: SampleStatus = SampleStatus.COMPLETED result: JSONSerializable = None class TaskOutput(BaseModel): index: Union[None, SampleIndex] = None status: SampleStatus = SampleStatus.RUNNING result: JSONSerializable = None history: Union[None, List[ChatHistoryItem]] = None class AgentOutput(BaseModel): status: AgentOutputStatus = AgentOutputStatus.NORMAL content: Union[str, None] = None ``` -------------------------------- ### POST /api/sync_all Source: https://github.com/thudm/agentbench/blob/main/docs/Entrance_en.md Synchronizes all sessions running on task workers. ```APIDOC ## POST /api/sync_all ### Description Syncs all sessions running on task_workers. This should be called if the controller restarts unexpectedly to ensure state consistency. ### Method POST ### Endpoint /api/sync_all ### Parameters None ### Response #### Success Response (200) - **status** (string) - Confirmation of synchronization. ``` -------------------------------- ### Session API Source: https://github.com/thudm/agentbench/blob/main/docs/Extension_en.md The Session interface allows tasks to interact with the Agent, enabling history injection and action triggering. ```APIDOC ## Session API ### Description Interface for communicating with the Agent during sample execution. ### Methods - **inject(item)**: Adds one or more ChatHistoryItem objects to the agent's history. - **action(*injection)**: Sends a prompt to the agent and waits for an AgentOutput response. Supports optional history injection during the call. ### AgentOutput Structure - **status**: AgentOutputStatus (NORMAL, CANCELLED, AGENT_CONTEXT_LIMIT) - **content**: The string response from the agent. ``` -------------------------------- ### Knowledge Graph Task Configuration Source: https://context7.com/thudm/agentbench/llms.txt Configures the Knowledge Graph task for querying large-scale knowledge bases via SPARQL. ```APIDOC ## POST /tasks/knowledgegraph ### Description Configures the Knowledge Graph task environment, including the SPARQL endpoint URL. ### Method POST ### Endpoint /tasks/knowledgegraph ### Parameters #### Request Body - **env_options** (object) - Required - Contains the 'urls' mapping for the SPARQL endpoint - **concurrency** (integer) - Required - Number of concurrent queries ### Request Example { "env_options": {"urls": {"kg": "http://localhost:3001/sparql"}}, "concurrency": 32 } ``` -------------------------------- ### Configure Knowledge Graph Task Source: https://context7.com/thudm/agentbench/llms.txt Configures the Knowledge Graph task to interface with a SPARQL endpoint for reasoning over large-scale data. ```yaml default: module: "src.server.tasks.knowledgegraph.KnowledgeGraph" parameters: concurrency: 32 max_rounds: 15 one_shot: false env_driver: manual env_options: urls: kg: http://localhost:3001/sparql ``` -------------------------------- ### Agent Configuration Structure Source: https://github.com/thudm/agentbench/blob/main/docs/Config_en.md Defines the structure for agent configuration files within the 'agents' directory. It specifies the required 'module' and 'parameters' fields for each agent. ```yaml agent_name: module: "agent_module_path" parameters: param1: value1 param2: value2 ``` -------------------------------- ### Define Custom Task Interface in Python Source: https://github.com/thudm/agentbench/blob/main/docs/Extension_en.md The Task base class provides the structure for implementing new benchmarks. Users must implement get_indices, start_sample, and calculate_overall to define task behavior. ```python class Task: def __init__(self, name: str, concurrency: int = 1, *args, **kwargs): self.name = name self.concurrency = concurrency def get_indices(self) -> List[SampleIndex]: raise NotImplementedError() async def start_sample(self, index: SampleIndex, session: Session) -> TaskSampleExecutionResult: raise NotImplementedError() def calculate_overall(self, results: List[TaskOutput]) -> Dict[str, Any]: raise NotImplementedError() def release(self): pass ``` -------------------------------- ### Orchestrate Evaluations with Assigner Script Source: https://context7.com/thudm/agentbench/llms.txt The Assigner script orchestrates LLM evaluations by managing task assignments, concurrency, and result saving. It supports custom configurations and auto-retry for failed samples. ```bash python -m src.assigner python -m src.assigner --config configs/assignments/default.yaml python -m src.assigner --auto-retry ``` -------------------------------- ### Configure OpenAI Chat API Agent Source: https://context7.com/thudm/agentbench/llms.txt YAML configuration for an HTTPAgent to connect to OpenAI or compatible API endpoints. It specifies the URL, headers, request body parameters, prompter, and response parsing format. ```yaml module: src.client.agents.HTTPAgent parameters: url: https://api.openai.com/v1/chat/completions headers: Content-Type: application/json Authorization: Bearer sk-your-api-key-here body: temperature: 0 max_tokens: 512 model: gpt-3.5-turbo-0613 prompter: name: role_content_dict args: agent_role: assistant return_format: "{response[choices][0][message][content]}" ``` -------------------------------- ### Evaluation Assigner Source: https://context7.com/thudm/agentbench/llms.txt Orchestrates evaluations by managing concurrent task execution, assigning samples using a maximum flow algorithm, and saving results. ```APIDOC ## Evaluation Assigner The Assigner script orchestrates evaluations by reading configuration files, managing concurrent task execution, and saving results in real-time. It uses a maximum flow algorithm to optimally assign samples to available agent-task worker pairs. ### Run Evaluation **Method:** `python -m src.assigner` **Parameters:** - `--config` (string, Optional): Path to the custom configuration file (e.g., `configs/assignments/default.yaml`). - `--auto-retry` (flag, Optional): Enable auto-retry for failed samples. **Example Configuration (`configs/assignments/default.yaml`):** ```yaml import: definition.yaml concurrency: task: dbbench-std: 5 os-std: 5 agent: gpt-3.5-turbo-0613: 5 assignments: - agent: - gpt-3.5-turbo-0613 task: - dbbench-std - os-std output: "outputs/{TIMESTAMP}" ``` **Examples:** ```bash # Run evaluation with default configuration python -m src.assigner # Run with custom configuration file python -m src.assigner --config configs/assignments/default.yaml # Run with auto-retry for failed samples python -m src.assigner --auto-retry ``` ``` -------------------------------- ### Interact with Task Source: https://github.com/thudm/agentbench/blob/main/docs/Introduction_en.md Facilitates interaction between an Agent and a Task. This endpoint receives the Agent's output, forwards it to the corresponding Task Worker, and returns the output from the Task Worker (the task environment's response). ```APIDOC ## POST /api/interact ### Description Allows an Agent to interact with a Task. The Agent's output is sent to the Task Worker, and the Task Worker's response (from the task environment) is returned. ### Method POST ### Endpoint /api/interact ### Parameters #### Request Body - **session_id** (string) - Required - The ID of the current session. - **agent_output** (string) - Required - The output from the agent. ### Request Example ```json { "session_id": "sess_abc123", "agent_output": "The summary is: ..." } ``` ### Response #### Success Response (200) - **task_output** (string) - The output from the task environment. #### Response Example ```json { "task_output": "Evaluation complete. Agent performed well." } ``` ``` -------------------------------- ### Task Controller API Source: https://context7.com/thudm/agentbench/llms.txt The Task Controller manages task workers and provides interfaces for client communication. It runs on port 5000 by default and exposes monitoring and control endpoints. ```APIDOC ## Task Controller API The Task Controller is the central component managing all task workers and providing unified interfaces for client communication. It runs on port 5000 by default and exposes monitoring and control endpoints. ### Start Task Controller Starts the task controller on the default port 5000 or a custom port. **Method:** `python -m src.server.task_controller` **Parameters:** - `-p` (int, Optional): Custom port number to run the controller on. **Example:** ```bash # Start on default port 5000 python -m src.server.task_controller # Start on custom port 3000 python -m src.server.task_controller -p 3000 ``` ### List Workers Monitors registered task workers. **Method:** `curl` **Endpoint:** `http://localhost:5000/api/list_workers` ### List Sessions Lists all active sessions. **Method:** `curl` **Endpoint:** `http://localhost:5000/api/list_sessions` ### Sync All Sessions Synchronizes all sessions after an unexpected controller restart. **Method:** `curl -X POST` **Endpoint:** `http://localhost:5000/api/sync_all` ### Cancel All Sessions Cancels all running sessions. **Method:** `curl -X POST` **Endpoint:** `http://localhost:5000/api/cancel_all` ``` -------------------------------- ### POST /api/cancel_all Source: https://github.com/thudm/agentbench/blob/main/docs/Entrance_en.md Cancels all currently running sessions across all task workers. ```APIDOC ## POST /api/cancel_all ### Description Cancels all sessions currently running on task_workers. ### Method POST ### Endpoint /api/cancel_all ### Parameters None ### Response #### Success Response (200) - **status** (string) - Confirmation of cancellation. ``` -------------------------------- ### Task Interface Definition Source: https://github.com/thudm/agentbench/blob/main/docs/Extension_en.md The base Task class provides the interface for defining custom evaluation tasks, including sample retrieval, execution logic, and result calculation. ```APIDOC ## Task Interface ### Description Base class for creating custom tasks in AgentBench. Users must inherit from this class and implement the required lifecycle methods. ### Methods - **get_indices()**: Returns a list of sample indices to be processed. - **start_sample(index, session)**: Executes logic for a single sample using the provided session proxy. - **calculate_overall(results)**: Aggregates results from all samples into a final score dictionary. - **release()**: Optional cleanup method executed after the worker process ends. ### Data Structures - **TaskSampleExecutionResult**: Contains the status and result of a single sample execution. - **TaskOutput**: Represents the full output of a task, including history and status. - **SampleStatus**: Enum representing the outcome (e.g., COMPLETED, AGENT_CONTEXT_LIMIT, TASK_ERROR). ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.