Hypothesize-Then-Verify: Speculative Root Cause Analysis for Microservices with Pathwise Parallelism
Lingzhe Zhang, Tong Jia, Yunpeng Zhai, Leyi Pan, Chiming Duan, Minghua He, Pei Xiao, Ying Li

TL;DR
This paper introduces SpecRCA, a novel root cause analysis framework for microservices that uses a hypothesize-then-verify approach to improve accuracy and efficiency in identifying anomalies.
Contribution
The paper presents a new speculative RCA framework that enhances exploration diversity and reduces inference time compared to existing LLM-based methods.
Findings
Achieves higher accuracy than existing methods on AIOps 2022 dataset.
Demonstrates improved efficiency in root cause verification.
Shows potential for scalable and interpretable RCA in microservices.
Abstract
Microservice systems have become the backbone of cloud-native enterprise applications due to their resource elasticity, loosely coupled architecture, and lightweight deployment. Yet, the intrinsic complexity and dynamic runtime interactions of such systems inevitably give rise to anomalies. Ensuring system reliability therefore hinges on effective root cause analysis (RCA), which entails not only localizing the source of anomalies but also characterizing the underlying failures in a timely and interpretable manner. Recent advances in intelligent RCA techniques, particularly those powered by large language models (LLMs), have demonstrated promising capabilities, as LLMs reduce reliance on handcrafted features while offering cross-platform adaptability, task generalization, and flexibility. However, existing LLM-based methods still suffer from two critical limitations: (a) limited…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Software-Defined Networks and 5G
