The Hierarchy of Agentic Capabilities: Evaluating Frontier Models on Realistic RL Environments
Logan Ritchie, Sushant Mehta, Nick Heiner, Mason Yu, Edwin Chen

TL;DR
This paper empirically evaluates leading AI models on 150 realistic workplace tasks, revealing a hierarchy of necessary agentic skills and highlighting significant performance gaps in complex, multi-step environments.
Contribution
It introduces a hierarchy of agentic capabilities for AI models and a task-centric RL environment design methodology, providing insights into current model limitations and development directions.
Findings
Models fail about 40% of tasks, with failures aligned to capability hierarchy.
Weaker models struggle with tool use and planning; stronger models falter on contextual inference.
Proposes a diverse, expert-informed RL environment design for better evaluation.
Abstract
The advancement of large language model (LLM) based agents has shifted AI evaluation from single-turn response assessment to multi-step task completion in interactive environments. We present an empirical study evaluating frontier AI models on 150 workplace tasks within a realistic e-commerce RL environment from Surge. Our analysis reveals an empirically-derived \emph{hierarchy of agentic capabilities} that models must master for real-world deployment: (1) tool use, (2) planning and goal formation, (3) adaptability, (4) groundedness, and (5) common-sense reasoning. Even the best-performing models fail approximately 40\% of the tasks, with failures clustering predictably along this hierarchy. Weaker models struggle with fundamental tool use and planning, whereas stronger models primarily fail on tasks requiring contextual inference beyond explicit instructions. We introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Topic Modeling · Natural Language Processing Techniques
