AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
Parsa Mazaheri, Kasra Mazaheri

TL;DR
AgentAtlas introduces a comprehensive framework with taxonomy and methodology to evaluate large language model agents across multiple behavioral axes, moving beyond simple outcome metrics.
Contribution
It extends evaluation by defining a control-decision taxonomy, a failure taxonomy, and a methodology to measure capability sources and benchmark coverage.
Findings
Removing explicit label menus reduces trajectory accuracy by 14-40 percentage points.
No single model excels across control accuracy, diagnosis, and utility retention.
The methodology demonstrates how different evaluation components impact model performance.
Abstract
Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but the benchmarks used to evaluate them are fragmented: each emphasizes a different unit of measurement (final task success, tool-call validity, repeated-pass consistency, trajectory safety, or attack robustness). A line of 2024-2025 work has converged on the diagnosis that a single accuracy column is no longer the right unit of comparison for deployable agents. AgentAtlas extends this line of work with four components: (i) a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a nine-category trajectory-failure taxonomy with two orthogonal hierarchical labels (primary_error_source, impact); (iii) a taxonomy-aware vs. taxonomy-blind methodology that measures how much of a model's apparent capability comes from the supervision in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
