Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems
Zeyan Li, Nengwen Zhao, Mingjie Li, Xianglin Lu, Lixin Wang, Dongdong, Chang, Xiaohui Nie, Li Cao, Wenzhi Zhang, Kaixin Sui, Yanhua Wang, Xu Du,, Guoqiang Duan, Dan Pei

TL;DR
DejaVu is a machine learning-based fault localization method for online service systems that provides fast, actionable, and interpretable results for recurring failures, outperforming existing baselines.
Contribution
The paper introduces DejaVu, an automated, interpretable fault localization approach tailored for recurring failures in online services, combining historical data with global and local explanations.
Findings
DejaVu ranks ground truths at an average position of 1.66 to 5.03 among candidates.
DejaVu outperforms baseline methods by at least 51.51%.
The approach provides fault localization in less than one second.
Abstract
Fault localization is challenging in an online service system due to its monitoring data's large volume and variety and complex dependencies across or within its components (e.g., services or databases). Furthermore, engineers require fault localization solutions to be actionable and interpretable, which existing research approaches cannot satisfy. Therefore, the common industry practice is that, for a specific online service system, its experienced engineers focus on localization for recurring failures based on the knowledge accumulated about the system and historical failures. Although the above common practice is actionable and interpretable, it is largely manual, thus slow and sometimes inaccurate. In this paper, we aim to automate this practice through machine learning. That is, we propose an actionable and interpretable fault localization approach, DejaVu, for recurring failures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
