CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms
Yingying Zhang, Zhengxiong Guan, Huajie Qian, Leili Xu, Hengbo Liu,, Qingsong Wen, Liang Sun, Junwei Jiang, Lunting Fan, Min Ke

TL;DR
CloudRCA is a novel framework that leverages multi-source data and hierarchical Bayesian modeling to accurately and efficiently identify root causes of failures in complex cloud computing platforms, improving reliability and reducing troubleshooting time.
Contribution
The paper introduces CloudRCA, a comprehensive root cause analysis framework that integrates heterogeneous data sources with a hierarchical Bayesian network for superior accuracy and scalability in cloud environments.
Findings
Outperforms existing methods in f1-score across platforms
Handles new root cause types due to hierarchical structure
Reduces troubleshooting time by over 20% in practice
Abstract
As business of Alibaba expands across the world among various industries, higher standards are imposed on the service quality and reliability of big data cloud computing platforms which constitute the infrastructure of Alibaba Cloud. However, root cause analysis in these platforms is non-trivial due to the complicated system architecture. In this paper, we propose a root cause analysis framework called CloudRCA which makes use of heterogeneous multi-source data including Key Performance Indicators (KPIs), logs, as well as topology, and extracts important features via state-of-the-art anomaly detection and log analysis techniques. The engineered features are then utilized in a Knowledge-informed Hierarchical Bayesian Network (KHBN) model to infer root causes with high accuracy and efficiency. Ablation study and comprehensive experimental comparisons demonstrate that, compared to existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Anomaly Detection Techniques and Applications · Data Quality and Management
Methodstravel james
