Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices
Lingzhe Zhang, Tong Jia, Yunpeng Zhai, Leyi Pan, Chiming Duan, Minghua He, Mengxi Jia, Ying Li

TL;DR
This paper introduces AMER-RCL, a novel recursive reasoning framework with agentic memory that improves root cause localization accuracy and efficiency in complex microservice systems by mimicking expert reasoning patterns.
Contribution
It proposes a new multi-agent recursive reasoning approach with agentic memory, addressing limitations of existing methods in adaptability, accuracy, and latency.
Findings
Outperforms state-of-the-art methods in accuracy
Reduces inference latency significantly
Effectively reuses reasoning across alerts
Abstract
As contemporary microservice systems become increasingly popular and complex-often comprising hundreds or even thousands of fine-grained, interdependent subsystems-they are experiencing more frequent failures. Ensuring system reliability thus demands accurate root cause localization. While many traditional graph-based and deep learning approaches have been explored for this task, they often rely heavily on pre-defined schemas that struggle to adapt to evolving operational contexts. Consequently, a number of LLM-based methods have recently been proposed. However, these methods still face two major limitations: shallow, symptom-centric reasoning that undermines accuracy, and a lack of cross-alert reuse that leads to redundant reasoning and high latency. In this paper, we conduct a comprehensive study of how Site Reliability Engineers (SREs) localize the root causes of failures, drawing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Software Reliability and Analysis Research
