TRAIL: Trace Reasoning and Agentic Issue Localization

Darshan Deshpande; Varun Gangal; Hersh Mehta; Jitin Krishnan; Anand Kannappan; Rebecca Qian

arXiv:2505.08638·cs.AI·June 25, 2025

TRAIL: Trace Reasoning and Agentic Issue Localization

Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kannappan, Rebecca Qian

PDF

1 Datasets

TL;DR

This paper introduces TRAIL, a comprehensive dataset and taxonomy for evaluating and debugging complex agentic workflow traces generated by AI systems, highlighting current limitations of large language models in trace analysis.

Contribution

It provides a formal taxonomy of errors, a large annotated dataset of 148 traces, and demonstrates the inadequacy of current models for trace debugging in agentic workflows.

Findings

01

LLMs perform poorly at trace debugging, with Gemini-2.5-pro scoring only 11%.

02

The TRAIL dataset enables systematic evaluation of agentic trace analysis.

03

The taxonomy aids in understanding and categorizing errors in agentic systems.

Abstract

The increasing adoption of agentic workflows across diverse domains brings a critical need to scalably and systematically evaluate the complex traces these systems generate. Current evaluation methods depend on manual, domain-specific human analysis of lengthy workflow traces - an approach that does not scale with the growing complexity and volume of agentic outputs. Error analysis in these settings is further complicated by the interplay of external tool outputs and language model reasoning, making it more challenging than traditional software debugging. In this work, we (1) articulate the need for robust and dynamic evaluation methods for agentic workflow traces, (2) introduce a formal taxonomy of error types encountered in agentic systems, and (3) present a set of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and grounded in established agentic benchmarks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

PatronusAI/TRAIL
dataset· 346 dl
346 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training