AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking
Dongxin Guo, Jikun Wu, Siu Ming Yiu

TL;DR
AgentEval introduces a DAG-structured evaluation framework for agentic workflows, significantly improving failure detection and root cause analysis over traditional methods, with practical benefits demonstrated in real-world testing.
Contribution
This work presents a novel DAG-based evaluation method with hierarchical failure taxonomy and automated root cause attribution, enhancing failure detection and diagnosis in agentic systems.
Findings
DAG modeling improves failure detection recall by 22 percentage points.
AgentEval achieves 2.17x higher failure detection recall than end-to-end evaluation.
In a pilot, it reduced root-cause identification time from 4.2 hours to 22 minutes.
Abstract
Agentic systems that chain reasoning, tool use, and synthesis into multi-step workflows are entering production, yet prevailing evaluation practices like end-to-end outcome checks and ad-hoc trace inspection systematically mask the intermediate failures that dominate real-world error budgets. We present AgentEval, a framework that formalizes agent executions as evaluation directed acyclic graphs (DAGs), where each node carries typed quality metrics assessed by a calibrated LLM judge (GPT-4o), classified through a hierarchical failure taxonomy (3 levels, 21 subcategories), and linked to upstream dependencies for automated root cause attribution. An ablation study isolates the impact of DAG-based dependency modeling: it alone contributes +22 percentage points to failure detection recall and +34 pp to root cause accuracy over flat step-level evaluation with identical judges and rubrics.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
