Efficient Failure Management for Multi-Agent Systems with Reasoning Trace Representation
Lingzhe Zhang, Tong Jia, Mingyu Wang, Weijie Hong, Chiming Duan, Minghua He, Rongqian Wang, Xi Peng, Meiling Wang, Gong Zhang, Renhai Chen, Ying Li

TL;DR
This paper introduces EAGER, a novel framework that leverages reasoning trace representation and historical failure patterns to improve real-time failure detection and diagnosis in multi-agent systems, enhancing reliability.
Contribution
The paper proposes EAGER, an innovative failure management framework using reasoning trace representation and contrastive learning to incorporate historical failure data for better diagnostics.
Findings
EAGER effectively detects failures in real-time within multi-agent systems.
Incorporating historical failure patterns improves diagnostic accuracy.
Preliminary evaluations show promising results on open-source MASs.
Abstract
Large Language Models (LLM)-based Multi-Agent Systems (MASs) have emerged as a new paradigm in software system design, increasingly demonstrating strong reasoning and collaboration capabilities. As these systems become more complex and autonomous, effective failure management is essential to ensure reliability and availability. However, existing approaches often rely on per-trace reasoning, which leads to low efficiency, and neglect historical failure patterns, limiting diagnostic accuracy. In this paper, we conduct a preliminary empirical study to demonstrate the necessity, potential, and challenges of leveraging historical failure patterns to enhance failure management in MASs. Building on this insight, we propose \textbf{EAGER}, an efficient failure management framework for multi-agent systems based on reasoning trace representation. EAGER employs unsupervised reasoning-scoped…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Advanced Software Engineering Methodologies · Software Engineering Techniques and Practices
