TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig

TL;DR
This paper introduces TheAgentCompany, a benchmark environment for evaluating large language model agents on real-world work tasks, revealing current systems can autonomously complete about 30% of tasks in a simulated workplace setting.
Contribution
It presents a new extensible benchmark environment and evaluates LLM agents' performance on realistic workplace tasks, highlighting current capabilities and limitations.
Findings
Most agents can autonomously complete 30% of tasks
Simpler tasks are more likely to be automated
Long-horizon, complex tasks remain challenging
Abstract
We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at accelerating or even autonomously performing work-related tasks? The answer to this question has important implications both for industry looking to adopt AI into their workflows and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
OpenAI Backtracks, Gunning for Superintelligence: Altman Brings His AGI Timeline Closer - '25 to '29· youtube
Taxonomy
TopicsBlockchain Technology Applications and Security
MethodsADaptive gradient method with the OPTimal convergence rate
