AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
Zhengkang Guo, Yiyang Li, Lin Qiu, Xiaohua Wang, Jingwen Xv, Dongyu Ru, Xiaoyu Li, Xiaoqing Zheng, Xuezhi Cao, Xunliang Cai

TL;DR
AgentEscapeBench is a benchmark designed to evaluate the ability of LLM agents to perform complex, long-range, tool-grounded reasoning tasks involving multiple dependencies and external functions.
Contribution
The paper introduces a novel escape-room-style benchmark with automated evaluation to measure LLM agents' performance on out-of-domain, long-range reasoning with external tools.
Findings
Performance drops significantly with increased dependency depth for both humans and models.
Humans maintain higher success rates than models at all difficulty levels.
Failures are mainly due to breakdowns in long-range state tracking and intermediate-result propagation.
Abstract
As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions. We introduce AgentEscapeBench, an escape-room-style benchmark that tests whether agents can infer, execute, and revise novel tool-use procedures under explicit long-range dependency constraints. Each task defines a directed acyclic dependency graph over tools and items, requiring agents to invoke real external functions, track hidden state revealed incrementally, propagate intermediate results, and submit a deterministically verifiable final answer. AgentEscapeBench includes 270 instances across five difficulty tiers and supports fully automated evaluation. Experiments with sixteen LLM agents and human participants show that performance drops sharply as dependency depth increases: humans decline from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
