AgentBench is a multi-dimensional benchmark that tests how well Large Language Models (LLMs) can act as intelligent agents on different tasks in 8 interactive environments. It aims to evaluate the reasoning and decision-making capabilities of an LLM across multiple interactions.
The new benchmark is open source and was proposed by a research team from Tsinghua University, the Ohio State University, and UC Berkeley.
It covers a wide range of domains and tasks, such as text-based games, dialogue systems, knowledge bases, and more. AgentBench also provides various evaluation methods and metrics, such as automatic and human evaluation, to measure the performance of different LLMs as agents.
The paper presents the results of testing over 25 LLMs (including APIs and open-source models) on AgentBench, showing that there is a significant gap between the commercial and open-source LLMs as agents.
LLM-as-Agent
An LLM-as-Agent is a type of artificial intelligence (AI) agent that uses an LLM as its brain. It has a goal and performs various tasks to reach it. It can generate prompts and responses, plan actions, reason about the environment, and learn from the feedback. To achieve its objectives, it uses LLMs and other tools, such as databases, APIs, or web services.
While an LLM-as-Agent has a purpose and can act on its own to achieve it, an LLM is just a language model that can generate natural language and code from a large corpus of text data.
Some examples of LLM-as-Agents are AgentBench, BabyAGI, and AutoGPT.
Composition of AgentBench
AgentBench consists of 8 distinct environments that cover different domains, scenarios, and challenges for LLMs as agents.
- 3 existing environments from published datasets: house-holding (Alfworld), web shopping (WebShop), and web browsing (Mind2Web).
- 5 new environments created by the research team, namely operating system (OS), database (DB), knowledge graph (KG), digital card game (DCG), and lateral thinking puzzles (LTP).
All these environments are designed to simulate real-world problems. For example LLMs were tested on real OS’ interactive bash environments (i.e., Ubuntu Docker) by asking them human questions that have definite answers or giving them series of operations to achieve practical objectives (i.e. “recursively set all files in the directory to read-only, except those of mine”).

Evaluation
The research team used AgentBench to evaluate the performance of 25 LLMs as agents in a practical and use-oriented way.
AgentBench guides the agent’s actions and its responses with natural language prompts, which are known as Chain-of-Thought prompting. All environments are designed to be interactive and allow LLMs to operate and learn as autonomous agents using text only.
AgentBench also provides a unified testing framework and toolkit that makes it easy to evaluate LLMs.
The figure below, which is divided in two panels (a) and (b), shows an overview of the performance of LLMs on a number of AgentBench tasks. The LLMs (API-based or open-sourced) are ranked from top to bottom, with the best-performing LLM at the top.

The commercial models (like GPT-4) perform much better than the open-source ones. They are capable of handling a wide array of real-world tasks, indicating their potential for developing a potent, continuously learning agent.
Conclusion
AgentBench is a multi-dimensional tool that can test LLMs on many real-world challenges. It has 8 different environments that are interactive and realistic, unlike the traditional datasets that are static and fixed.
Learn more:
- Research paper: “AgentBench: Evaluating LLMs as Agents” (on arXiv)
- Project page (code, quick start, tutorial)