BALANCE: Bayesian Linear Attribution for Root Cause Localization
Chaoyu Chen, Hang Yu, Zhichao Lei, Jianguo Li, Shaokang Ren, Tingkai, Zhang, Silin Hu, Jianchao Wang, Wenhui Shi

TL;DR
BALANCE introduces a Bayesian attribution framework for root cause localization in distributed systems, leveraging explainable AI techniques to improve accuracy and efficiency in identifying system faults.
Contribution
It pioneers the use of explainable AI for RCA, combining Bayesian feature selection and attribution analysis to enhance fault localization accuracy in real-world systems.
Findings
Outperforms state-of-the-art methods in accuracy and speed.
Achieves at least 6% higher accuracy on real-world tasks.
Successfully deployed in production for real-time fault diagnosis.
Abstract
Root Cause Analysis (RCA) plays an indispensable role in distributed data system maintenance and operations, as it bridges the gap between fault detection and system recovery. Existing works mainly study multidimensional localization or graph-based root cause localization. This paper opens up the possibilities of exploiting the recently developed framework of explainable AI (XAI) for the purpose of RCA. In particular, we propose BALANCE (BAyesian Linear AttributioN for root CausE localization), which formulates the problem of RCA through the lens of attribution in XAI and seeks to explain the anomalies in the target KPIs by the behavior of the candidate root causes. BALANCE consists of three innovative components. First, we propose a Bayesian multicollinear feature selection (BMFS) model to predict the target KPIs given the candidate root causes in a forward manner while promoting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Data Quality and Management · Cloud Computing and Resource Management
MethodsFeature Selection
