The Hierarchy of Agentic Capabilities: Evaluating Frontier Models on Realistic RL Environments

Logan Ritchie; Sushant Mehta; Nick Heiner; Mason Yu; Edwin Chen

arXiv:2601.09032·cs.AI·January 15, 2026

The Hierarchy of Agentic Capabilities: Evaluating Frontier Models on Realistic RL Environments

Logan Ritchie, Sushant Mehta, Nick Heiner, Mason Yu, Edwin Chen

PDF

Open Access

TL;DR

This paper empirically evaluates leading AI models on 150 realistic workplace tasks, revealing a hierarchy of necessary agentic skills and highlighting significant performance gaps in complex, multi-step environments.

Contribution

It introduces a hierarchy of agentic capabilities for AI models and a task-centric RL environment design methodology, providing insights into current model limitations and development directions.

Findings

01

Models fail about 40% of tasks, with failures aligned to capability hierarchy.

02

Weaker models struggle with tool use and planning; stronger models falter on contextual inference.

03

Proposes a diverse, expert-informed RL environment design for better evaluation.

Abstract

The advancement of large language model (LLM) based agents has shifted AI evaluation from single-turn response assessment to multi-step task completion in interactive environments. We present an empirical study evaluating frontier AI models on 150 workplace tasks within a realistic e-commerce RL environment from Surge. Our analysis reveals an empirically-derived \emph{hierarchy of agentic capabilities} that models must master for real-world deployment: (1) tool use, (2) planning and goal formation, (3) adaptability, (4) groundedness, and (5) common-sense reasoning. Even the best-performing models fail approximately 40\% of the tasks, with failures clustering predictably along this hierarchy. Weaker models struggle with fundamental tool use and planning, whereas stronger models primarily fail on tasks requiring contextual inference beyond explicit instructions. We introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Topic Modeling · Natural Language Processing Techniques