AgentTrace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems

Zhaohui Geoffrey Wang

arXiv:2603.14688·cs.LG·March 30, 2026

AgentTrace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems

Zhaohui Geoffrey Wang

PDF

TL;DR

AgentTrace is a lightweight causal tracing framework that reconstructs causal graphs from logs to accurately identify root causes of failures in deployed multi-agent systems, improving reliability.

Contribution

It introduces a novel causal graph reconstruction method that does not rely on LLM inference, enabling fast and accurate root cause analysis in real-world multi-agent deployments.

Findings

01

Localizes root causes with high accuracy

02

Operates with sub-second latency

03

Outperforms heuristic and LLM-based baselines

Abstract

As multi-agent AI systems are increasingly deployed in real-world settings - from automated customer support to DevOps remediation - failures become harder to diagnose due to cascading effects, hidden dependencies, and long execution traces. We present AgentTrace, a lightweight causal tracing framework for post-hoc failure diagnosis in deployed multi-agent workflows. AgentTrace reconstructs causal graphs from execution logs, traces backward from error manifestations, and ranks candidate root causes using interpretable structural and positional signals - without requiring LLM inference at debugging time. Across a diverse benchmark of multi-agent failure scenarios designed to reflect common deployment patterns, AgentTrace localizes root causes with high accuracy and sub-second latency, significantly outperforming both heuristic and LLM-based baselines. Our results suggest that causal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.