TRAIL: Trace Reasoning and Agentic Issue Localization
Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kannappan, Rebecca Qian

TL;DR
This paper introduces TRAIL, a comprehensive dataset and taxonomy for evaluating and debugging complex agentic workflow traces generated by AI systems, highlighting current limitations of large language models in trace analysis.
Contribution
It provides a formal taxonomy of errors, a large annotated dataset of 148 traces, and demonstrates the inadequacy of current models for trace debugging in agentic workflows.
Findings
LLMs perform poorly at trace debugging, with Gemini-2.5-pro scoring only 11%.
The TRAIL dataset enables systematic evaluation of agentic trace analysis.
The taxonomy aids in understanding and categorizing errors in agentic systems.
Abstract
The increasing adoption of agentic workflows across diverse domains brings a critical need to scalably and systematically evaluate the complex traces these systems generate. Current evaluation methods depend on manual, domain-specific human analysis of lengthy workflow traces - an approach that does not scale with the growing complexity and volume of agentic outputs. Error analysis in these settings is further complicated by the interplay of external tool outputs and language model reasoning, making it more challenging than traditional software debugging. In this work, we (1) articulate the need for robust and dynamic evaluation methods for agentic workflow traces, (2) introduce a formal taxonomy of error types encountered in agentic systems, and (3) present a set of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and grounded in established agentic benchmarks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
