TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Frank F. Xu; Yufan Song; Boxuan Li; Yuxuan Tang; Kritanjali Jain; Mengxue Bao; Zora Z. Wang; Xuhui Zhou; Zhitong Guo; Murong Cao; Mingyang Yang; Hao Yang Lu; Amaad Martin; Zhe Su; Leander Maben; Raj Mehta; Wayne Chi; Lawrence Jang; Yiqing Xie; Shuyan Zhou; Graham Neubig

arXiv:2412.14161·cs.CL·September 11, 2025·5 cites

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig

PDF

Open Access 2 Repos 1 Datasets 1 Video

TL;DR

This paper introduces TheAgentCompany, a benchmark environment for evaluating large language model agents on real-world work tasks, revealing current systems can autonomously complete about 30% of tasks in a simulated workplace setting.

Contribution

It presents a new extensible benchmark environment and evaluates LLM agents' performance on realistic workplace tasks, highlighting current capabilities and limitations.

Findings

01

Most agents can autonomously complete 30% of tasks

02

Simpler tasks are more likely to be automated

03

Long-horizon, complex tasks remain challenging

Abstract

We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at accelerating or even autonomously performing work-related tasks? The answer to this question has important implications both for industry looking to adopt AI into their workflows and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

ScaleAI/lhaw
dataset· 121 dl
121 dl

Videos

OpenAI Backtracks, Gunning for Superintelligence: Altman Brings His AGI Timeline Closer - '25 to '29· youtube

Taxonomy

TopicsBlockchain Technology Applications and Security

MethodsADaptive gradient method with the OPTimal convergence rate