AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang

TL;DR
AgentBench is a comprehensive benchmark for evaluating large language models as agents across diverse environments, revealing performance gaps and guiding improvements in reasoning, decision-making, and instruction following.
Contribution
The paper introduces AgentBench, a multi-dimensional benchmark with 8 environments for assessing LLMs as agents, and provides extensive evaluation results and insights into their strengths and weaknesses.
Findings
Top commercial LLMs excel in complex environments.
OSS LLMs underperform compared to commercial counterparts.
Poor long-term reasoning and instruction following are key challenges.
Abstract
The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents.…
Peer Reviews
Decision·ICLR 2024 poster
- The paper proposes a comprehensive benchmark with evaluation results on a wide-set of tasks
- The contributions of the paper seem very limited -- the paper does not propose any new technical insights and simply applies a variety of LLMs on many existing environments. - The analysis is not particularly insightful and I'm not sure if the conclusions are fully accurate from the analysis. For instance, task length exceeded may just be due to the fact that many LLMs are trained on short fixed context lengths. In setting such as this, it would be better to first summarize the long context t
- The paper is well-written and easy to follow. - The benchmark covers diverse tasks and includes a well-designed HTTP evaluation interface. Overall it seems well thought through. - The experiment results over 27 models could be very useful reference for LLM development
- The benchmark seems to use the same prompt for all models, which might give an unfair advantage to the model where these prompts were developed for. - There could be data leakage to the tasks selected from the pretraining data over the internet.
1. This is quite a unique type of benchmark, and could have profound implications for future LLM research or LLM as agent applications. 2. The benchmark captures a variety of tasks.
1. The benchmark does not seem to offer any insights for improvement. (i.e. If my model is not doing well on web-browsing, what should I do?) 2. The embodied tasks seem quite contrived. AlfWorld drops all 2d/3d aspects of the environment and could be mastered by a fine-tuned GPT-2 [1]. 3. The benchmark seems to be mostly coding based. Non-coding LLMs could potentially still behave as good agents, but would underperform on this benchmark. Overall, I like the paper direction. All the below weak
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
