Characterizing and Mitigating Anti-patterns of Alerts in Industrial Cloud Systems
Tianyi Yang, Jiacheng Shen, Yuxin Su, Xiaoxue Ren, Yongqiang Yang,, Michael R. Lyu

TL;DR
This paper empirically studies anti-patterns in cloud alerts at Huawei Cloud, identifying common issues and mitigation strategies to improve alert quality and cloud system reliability.
Contribution
It provides the first empirical analysis of alert anti-patterns in an industrial cloud environment, including classification, mitigation, and future evaluation directions.
Findings
Identified four individual anti-patterns of alerts.
Discovered two collective anti-patterns affecting alert effectiveness.
Proposed guidelines and future directions for automatic alert quality evaluation.
Abstract
Alerts are crucial for requesting prompt human intervention upon cloud anomalies. The quality of alerts significantly affects the cloud reliability and the cloud provider's business revenue. In practice, we observe on-call engineers being hindered from quickly locating and fixing faulty cloud services because of the vast existence of misleading, non-informative, non-actionable alerts. We call the ineffectiveness of alerts "anti-patterns of alerts". To better understand the anti-patterns of alerts and provide actionable measures to mitigate anti-patterns, in this paper, we conduct the first empirical study on the practices of mitigating anti-patterns of alerts in an industrial cloud system. We study the alert strategies and the alert processing procedure at Huawei Cloud, a leading cloud provider. Our study combines the quantitative analysis of millions of alerts in two years and a survey…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Big Data and Business Intelligence · Software System Performance and Reliability
