AgentTrace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems
Zhaohui Geoffrey Wang

TL;DR
AgentTrace is a lightweight causal tracing framework that reconstructs causal graphs from logs to accurately identify root causes of failures in deployed multi-agent systems, improving reliability.
Contribution
It introduces a novel causal graph reconstruction method that does not rely on LLM inference, enabling fast and accurate root cause analysis in real-world multi-agent deployments.
Findings
Localizes root causes with high accuracy
Operates with sub-second latency
Outperforms heuristic and LLM-based baselines
Abstract
As multi-agent AI systems are increasingly deployed in real-world settings - from automated customer support to DevOps remediation - failures become harder to diagnose due to cascading effects, hidden dependencies, and long execution traces. We present AgentTrace, a lightweight causal tracing framework for post-hoc failure diagnosis in deployed multi-agent workflows. AgentTrace reconstructs causal graphs from execution logs, traces backward from error manifestations, and ranks candidate root causes using interpretable structural and positional signals - without requiring LLM inference at debugging time. Across a diverse benchmark of multi-agent failure scenarios designed to reflect common deployment patterns, AgentTrace localizes root causes with high accuracy and sub-second latency, significantly outperforming both heuristic and LLM-based baselines. Our results suggest that causal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
