AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
Wang Yang, Chaoda Song, Xinpeng Li, Debargha Ganguly, Chuang Ma, Shouren Wang, Zhihao Dou, Yuli Zhou, Vipin Chaudhary, Xiaotian Han

TL;DR
AgentCE-Bench introduces a scalable, controllable, and lightweight benchmark for evaluating agent reasoning across diverse models and domains, addressing existing limitations in environment overhead and task imbalance.
Contribution
It proposes a unified grid-based planning benchmark with adjustable horizons and difficulty, enabling fast, reproducible, and interpretable agent evaluation.
Findings
H and B parameters reliably control task horizon and difficulty.
AgentCE-Bench shows strong domain consistency and model discriminability.
Significant performance variation observed across models and domains.
Abstract
Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41\% of total evaluation time) and imbalanced task horizon and difficulty distributions that make aggregate scores unreliable. To address these issues, we propose AgentCE-Bench built around a unified grid-based planning task, where agents must fill hidden slots in a partially completed schedule subject to both local slot constraints and global constraints. Our benchmark offers fine-grained control through two orthogonal axes: \textbf{Scalable Horizons}, controlled by the number of hidden slots , and \textbf{Controllable Difficulty}, governed by a decoy budget that determines the number of globally misleading decoy candidates. Crucially, all tool calls are resolved via static JSON files under a \textbf{Lightweight Environment} design, eliminating setup overhead and enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
