AgentBench: Evaluating LLMs as Agents

Xiao Liu; Hao Yu; Hanchen Zhang; Yifan Xu; Xuanyu Lei; Hanyu Lai; Yu Gu; Hangliang Ding; Kaiwen Men; Kejuan Yang; Shudan Zhang; Xiang Deng; Aohan Zeng; Zhengxiao Du; Chenhui Zhang; Sheng Shen; Tianjun Zhang; Yu Su; Huan Sun; Minlie Huang; Yuxiao Dong; Jie Tang

arXiv:2308.03688·cs.AI·October 7, 2025·48 cites

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

AgentBench is a comprehensive benchmark for evaluating large language models as agents across diverse environments, revealing performance gaps and guiding improvements in reasoning, decision-making, and instruction following.

Contribution

The paper introduces AgentBench, a multi-dimensional benchmark with 8 environments for assessing LLMs as agents, and provides extensive evaluation results and insights into their strengths and weaknesses.

Findings

01

Top commercial LLMs excel in complex environments.

02

OSS LLMs underperform compared to commercial counterparts.

03

Poor long-term reasoning and instruction following are key challenges.

Abstract

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents.…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 3· reject, not good enoughConfidence 4

Strengths

- The paper proposes a comprehensive benchmark with evaluation results on a wide-set of tasks

Weaknesses

- The contributions of the paper seem very limited -- the paper does not propose any new technical insights and simply applies a variety of LLMs on many existing environments. - The analysis is not particularly insightful and I'm not sure if the conclusions are fully accurate from the analysis. For instance, task length exceeded may just be due to the fact that many LLMs are trained on short fixed context lengths. In setting such as this, it would be better to first summarize the long context t

Reviewer 02Rating 8· accept, good paperConfidence 3

Strengths

- The paper is well-written and easy to follow. - The benchmark covers diverse tasks and includes a well-designed HTTP evaluation interface. Overall it seems well thought through. - The experiment results over 27 models could be very useful reference for LLM development

Weaknesses

- The benchmark seems to use the same prompt for all models, which might give an unfair advantage to the model where these prompts were developed for. - There could be data leakage to the tasks selected from the pretraining data over the internet.

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. This is quite a unique type of benchmark, and could have profound implications for future LLM research or LLM as agent applications. 2. The benchmark captures a variety of tasks.

Weaknesses

1. The benchmark does not seem to offer any insights for improvement. (i.e. If my model is not doing well on web-browsing, what should I do?) 2. The embodied tasks seem quite contrived. AlfWorld drops all 2d/3d aspects of the environment and could be mastered by a fine-tuned GPT-2 [1]. 3. The benchmark seems to be mostly coding based. Non-coding LLMs could potentially still behave as good agents, but would underperform on this benchmark. Overall, I like the paper direction. All the below weak

Code & Models

Repositories

thudm/agentbench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques