Graph-based Incident Aggregation for Large-Scale Online Service Systems
Zhuangbin Chen, Jinyang Liu, Yuxin Su, Hongyu Zhang, Xuemin Wen, Xiao, Ling, Yongqiang Yang, Michael R. Lyu

TL;DR
This paper introduces GRLIA, a graph representation learning framework that efficiently aggregates incidents in large-scale online service systems by capturing topological and temporal correlations, improving incident management accuracy.
Contribution
The paper presents a novel graph-based, unsupervised learning approach for incident aggregation that leverages system monitoring data to improve accuracy and has been successfully deployed industrially.
Findings
GRLIA outperforms existing incident aggregation methods.
It effectively captures cascading failure correlations.
The framework is successfully deployed in Huawei Cloud.
Abstract
As online service systems continue to grow in terms of complexity and volume, how service incidents are managed will significantly impact company revenue and user trust. Due to the cascading effect, cloud failures often come with an overwhelming number of incidents from dependent services and devices. To pursue efficient incident management, related incidents should be quickly aggregated to narrow down the problem scope. To this end, in this paper, we propose GRLIA, an incident aggregation framework based on graph representation learning over the cascading graph of cloud failures. A representation vector is learned for each unique type of incident in an unsupervised and unified manner, which is able to simultaneously encode the topological and temporal correlations among incidents. Thus, it can be easily employed for online incident aggregation. In particular, to learn the correlations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Complex Network Analysis Techniques · Cloud Computing and Resource Management
Methodstravel james · GRLIA
