Multi-Agent Systems for Root Cause Analysis in Microservices
Alexander Naakka, Yuqing Wang, Mika V M\"antyl\"a

TL;DR
This paper introduces LATS-RCA, a multi-agent framework using large language models for root cause analysis in microservices, employing a tree search guided by reflection scores to improve diagnostic accuracy.
Contribution
The paper presents a novel multi-agent LLM-based approach for RCA that formulates diagnosis as a reflection-guided tree search, enhancing accuracy and applicability in complex microservice systems.
Findings
LATS-RCA achieves high diagnostic accuracy on Light-OAuth2.
It demonstrates practical applicability in real-world production environments.
Benchmarking shows the computational costs associated with the approach.
Abstract
Recent advances in large language models (LLMs) have enabled early attempts to automate root cause analysis (RCA) in microservice-based systems (MSS). Yet, prior works typically rely on a linear reasoning process that proceeds along a single diagnostic path. In this paper, we propose LATS-RCA, an LLM-based multi-agent framework for RCA in MSS. LATS-RCA formulates RCA as a reflection-guided tree-structured search using a Language Agent Tree Search algorithm. In LATS-RCA, multiple LLM-driven agents iteratively perform RCA for each microservice by reasoning over its execution logs and performance metrics to collect operational evidence for root cause exploration. Reflection scores derived from intermediate diagnostic states are used to guide the search toward the most likely root cause based on accumulated evidence. We evaluate LATS-RCA on the open-source industrial MSS, Light-OAuth2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
