AgentRx: Diagnosing AI Agent Failures from Execution Trajectories

Shraddha Barke; Arnav Goyal; Alind Khare; Avaljot Singh; Suman Nath; Chetan Bansal

arXiv:2602.02475·cs.AI·February 3, 2026

AgentRx: Diagnosing AI Agent Failures from Execution Trajectories

Shraddha Barke, Arnav Goyal, Alind Khare, Avaljot Singh, Suman Nath, Chetan Bansal

PDF

Open Access 1 Datasets

TL;DR

AgentRx introduces a novel, automated diagnostic framework that localizes failure points in AI agent trajectories, significantly improving failure attribution across diverse tasks and reducing human effort.

Contribution

This work provides a new benchmark of annotated failure trajectories and a domain-agnostic diagnostic tool that enhances failure localization and attribution in AI agents.

Findings

01

Improved accuracy in failure step localization.

02

Effective cross-domain failure attribution.

03

Benchmark dataset of 115 annotated failure trajectories.

Abstract

AI agents often fail in ways that are difficult to localize because executions are probabilistic, long-horizon, multi-agent, and mediated by noisy tool outputs. We address this gap by manually annotating failed agent runs and release a novel benchmark of 115 failed trajectories spanning structured API workflows, incident management, and open-ended web/file tasks. Each trajectory is annotated with a critical failure step and a category from a grounded-theory derived, cross-domain failure taxonomy. To mitigate the human cost of failure attribution, we present AGENTRX, an automated domain-agnostic diagnostic framework that pinpoints the critical failure step in a failed agent trajectory. It synthesizes constraints, evaluates them step-by-step, and produces an auditable validation log of constraint violations with associated evidence; an LLM-based judge uses this log to localize the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

microsoft/AgentRx
dataset· 45 dl
45 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · AI-based Problem Solving and Planning · Explainable Artificial Intelligence (XAI)