AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

Shicheng Fang; Yuxin Wang; Xiaoran Liu; Jiahao Lu; Chuanyuan Tan; Xinchi Chen; Yining Zheng; Xuanjing Huang; Xipeng Qiu

arXiv:2601.20730·cs.CL·February 2, 2026

AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

Shicheng Fang, Yuxin Wang, Xiaoran Liu, Jiahao Lu, Chuanyuan Tan, Xinchi Chen, Yining Zheng, Xuanjing Huang, Xipeng Qiu

PDF

Open Access 1 Datasets

TL;DR

AgentLongBench introduces a dynamic, environment-based benchmark for evaluating long-context agents, revealing their struggles with information synthesis in complex, knowledge-intensive scenarios.

Contribution

It presents a novel environment rollout framework for long-context evaluation, addressing static limitations of existing benchmarks and exposing new challenges for large language models.

Findings

01

State-of-the-art models excel at static retrieval but struggle with dynamic information synthesis.

02

Performance degradation correlates with the minimum token count needed to resolve queries.

03

High information density in tool responses significantly challenges agent capabilities.

Abstract

The evolution of Large Language Models (LLMs) into autonomous agents necessitates the management of extensive, dynamic contexts. Current benchmarks, however, remain largely static, relying on passive retrieval tasks that fail to simulate the complexities of agent-environment interaction, such as non-linear reasoning and iterative feedback. To address this, we introduce \textbf{AgentLongBench}, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles. This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios. Experiments with state-of-the-art models and memory systems (32K to 4M tokens) expose a critical weakness: while adept at static retrieval, agents struggle with the dynamic information synthesis essential for workflows. Our analysis indicates that this degradation is driven by the minimum…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ign1s/AgentLongBench
dataset· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Language and cultural evolution