Holistic Evaluation and Failure Diagnosis of AI Agents

Netta Madvil; Gilad Dym; Alon Mecilati; Edo Dekel; Jonatan Liberman; Rotem Brazilay; Liron Schliesser; Max Svidlo; Shai Nir; Orel Shalom; Yaron Friedman; David Connack; Amos Rimon; Philip Tannor; and Shir Chorev

arXiv:2605.14865·cs.AI·May 15, 2026

Holistic Evaluation and Failure Diagnosis of AI Agents

Netta Madvil, Gilad Dym, Alon Mecilati, Edo Dekel, Jonatan Liberman, Rotem Brazilay, Liron Schliesser, Max Svidlo, Shai Nir, Orel Shalom, Yaron Friedman, David Connack, Amos Rimon, Philip Tannor, and Shir Chorev

PDF

TL;DR

This paper introduces a comprehensive evaluation framework for AI agents that combines top-down and bottom-up analysis, providing detailed failure diagnosis and achieving state-of-the-art results on multiple benchmarks.

Contribution

It presents a scalable, span-level evaluation method that improves failure localization and diagnosis, surpassing prior approaches in accuracy and granularity.

Findings

01

Achieves up to 38% improvement in category F1 score.

02

Up to 3.5x better localization accuracy.

03

Up to 12.5x improvement in joint localization-categorization accuracy.

Abstract

AI agents execute complex multi-step processes, but current evaluation falls short: outcome metrics report success or failure without explaining why, and process-level approaches struggle to connect failure types to their precise locations within long, structured traces. We present a holistic agent evaluation framework that pairs top-down agent-level diagnosis with bottom-up span-level evaluation, decomposing analysis into independent per-span assessments. This decomposition scales to traces of arbitrary length and produces span-level rationales for each verdict. On the TRAIL benchmark, our framework achieves state-of-the-art results across all metrics on both GAIA and SWE-Bench, with relative gains over the strongest prior baselines of up to 38% on category F1, up to 3.5x on localization accuracy, and up to 12.5x on joint localization-categorization accuracy. Per-category analysis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.