AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Keyu Li; Junhao Shi; Yang Xiao; Mohan Jiang; Jie Sun; Yunze Wu; Dayuan Fu; Shijie Xia; Xiaojie Cai; Tianze Xu; Weiye Si; Wenjie Li; Dequan Wang; Pengfei Liu

arXiv:2601.11044·cs.AI·April 24, 2026

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Keyu Li, Junhao Shi, Yang Xiao, Mohan Jiang, Jie Sun, Yunze Wu, Dayuan Fu, Shijie Xia, Xiaojie Cai, Tianze Xu, Weiye Si, Wenjie Li, Dequan Wang, Pengfei Liu

PDF

1 Repo 3 Datasets

TL;DR

AgencyBench is a comprehensive benchmark for evaluating autonomous agents across real-world scenarios, emphasizing long-horizon tasks, automated evaluation, and comparing open-source and proprietary models.

Contribution

Introduces AgencyBench, a large-scale benchmark derived from daily AI usage, with automated evaluation methods and analysis of diverse agentic capabilities.

Findings

01

Closed-source models outperform open-source models (48.4% vs 32.1%).

02

Significant disparities in resource efficiency and feedback-driven self-correction.

03

Proprietary models excel within their ecosystems, open-source models show distinct performance peaks.

Abstract

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

GAIR-NLP/AgencyBench
github

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.