MetaRCA: A Generalizable Root Cause Analysis Framework for Cloud-Native Systems Powered by Meta Causal Knowledge
Shuai Liang, Pengfei Chen, Bozhe Tian, Gou Tan, Maohong Xu, Youjun Qu, Yahui Zhao, Yiduo Shang, and Chongkang Tan

TL;DR
MetaRCA is a novel, scalable, and generalizable framework for root cause analysis in cloud-native systems, leveraging a reusable knowledge base and real-time data to improve accuracy and adaptability across diverse system topologies.
Contribution
It introduces MetaRCA, which constructs a Meta Causal Graph from multiple knowledge sources and enables efficient, accurate root cause analysis with strong cross-system generalization.
Findings
Outperforms baselines by 29-48 percentage points in accuracy.
Maintains over 80% accuracy across diverse systems.
Scales near-linearly with system complexity.
Abstract
The dynamics and complexity of cloud-native systems present significant challenges for Root Cause Analysis (RCA). While causality-based RCA methods have shown significant progress in recent years, their practical adoption is fundamentally limited by three intertwined challenges: poor scalability against system complexity, brittle generalization across different system topologies, and inadequate integration of domain knowledge. These limitations create a vicious cycle, hindering the development of robust and efficient RCA solutions. This paper introduces MetaRCA, a generalizable RCA framework for cloud-native systems. MetaRCA first constructs a Meta Causal Graph (MCG) offline, a reusable knowledge base defined at the metadata level. To build the MCG, we propose an evidence-driven algorithm that systematically fuses knowledge from Large Language Models (LLMs), historical fault reports,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Bayesian Modeling and Causal Inference · Advanced Graph Neural Networks
