Actionable and Interpretable Fault Localization for Recurring Failures   in Online Service Systems

Zeyan Li; Nengwen Zhao; Mingjie Li; Xianglin Lu; Lixin Wang; Dongdong; Chang; Xiaohui Nie; Li Cao; Wenzhi Zhang; Kaixin Sui; Yanhua Wang; Xu Du,; Guoqiang Duan; Dan Pei

arXiv:2207.09021·cs.SE·September 7, 2022

Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems

Zeyan Li, Nengwen Zhao, Mingjie Li, Xianglin Lu, Lixin Wang, Dongdong, Chang, Xiaohui Nie, Li Cao, Wenzhi Zhang, Kaixin Sui, Yanhua Wang, Xu Du,, Guoqiang Duan, Dan Pei

PDF

2 Repos

TL;DR

DejaVu is a machine learning-based fault localization method for online service systems that provides fast, actionable, and interpretable results for recurring failures, outperforming existing baselines.

Contribution

The paper introduces DejaVu, an automated, interpretable fault localization approach tailored for recurring failures in online services, combining historical data with global and local explanations.

Findings

01

DejaVu ranks ground truths at an average position of 1.66 to 5.03 among candidates.

02

DejaVu outperforms baseline methods by at least 51.51%.

03

The approach provides fault localization in less than one second.

Abstract

Fault localization is challenging in an online service system due to its monitoring data's large volume and variety and complex dependencies across or within its components (e.g., services or databases). Furthermore, engineers require fault localization solutions to be actionable and interpretable, which existing research approaches cannot satisfy. Therefore, the common industry practice is that, for a specific online service system, its experienced engineers focus on localization for recurring failures based on the knowledge accumulated about the system and historical failures. Although the above common practice is actionable and interpretable, it is largely manual, thus slow and sometimes inaccurate. In this paper, we aim to automate this practice through machine learning. That is, we propose an actionable and interpretable fault localization approach, DejaVu, for recurring failures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.