RADICE: Causal Graph Based Root Cause Analysis for System Performance Diagnostic
Andrea Tonon, Meng Zhang, Bora Caglayan, Fei Shen, Tong Gui, MingXue, Wang, Rong Zhou

TL;DR
RADICE is a novel causal graph-based method for root cause analysis in system performance diagnostics, leveraging domain knowledge and causal discovery to improve accuracy over traditional correlation-based approaches.
Contribution
The paper introduces RADICE, a new algorithm that integrates causal domain knowledge with graph discovery techniques for more effective root cause analysis.
Findings
RADICE outperforms baseline methods in simulated data tests.
RADICE successfully identified root causes in a real-world case study.
The approach effectively incorporates partial domain knowledge into causal analysis.
Abstract
Root cause analysis is one of the most crucial operations in software reliability regarding system performance diagnostic. It aims to identify the root causes of system performance anomalies, allowing the resolution or the future prevention of issues that can cause millions of dollars in losses. Common existing approaches relying on data correlation or full domain expert knowledge are inaccurate or infeasible in most industrial cases, since correlation does not imply causation, and domain experts may not have full knowledge of complex and real-time systems. In this work, we define a novel causal domain knowledge model representing causal relations about the underlying system components to allow domain experts to contribute partial domain knowledge for root cause analysis. We then introduce RADICE, an algorithm that through the causal graph discovery, enhancement, refinement, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Data Quality and Management · Risk and Safety Analysis
