Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems
Mengzhuo Chen, Junjie Wang, Fangwen Mu, Yawen Wang, Zhe Liu, Huanxiang Feng, Qing Wang

TL;DR
This paper introduces TraceElephant, a comprehensive benchmark with full execution traces for failure attribution in LLM-based multi-agent systems, demonstrating significant improvements over partial-observation methods.
Contribution
The paper presents TraceElephant, a novel benchmark that enables failure attribution with complete execution traces, aligning evaluation practices with real-world debugging scenarios.
Findings
Full traces increase attribution accuracy by up to 76%.
Missing inputs significantly obscure failure causes.
TraceElephant facilitates more realistic evaluation of attribution techniques.
Abstract
Failure attribution, i.e., identifying the responsible agent and decisive step of a failure, is particularly challenging in LLM-based multi-agent systems (MAS) due to their natural-language reasoning, nondeterministic outputs, and intricate interaction dynamics. A reliable benchmark is therefore essential to guide and evaluate attribution techniques. Yet existing benchmarks rely on partially observable traces that capture only agent outputs, omitting the inputs and context that developers actually use when debugging. We argue that failure attribution should be studied under full execution observability, aligning with real-world developer-facing scenarios where complete traces, rather than only outputs, are accessible for diagnosis. To this end, we introduce TraceElephant, a benchmark designed for failure attribution with full execution traces and reproducible environments. We then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
