AlertGuardian: Intelligent Alert Life-Cycle Management for Large-scale Cloud Systems
Guangba Yu, Genting Mai, Rui Wang, Ruipeng Li, Pengfei Chen, Long Pan, Ruijie Xu

TL;DR
AlertGuardian is a framework that combines large language models and graph learning to reduce alert fatigue and improve fault diagnosis in large-scale cloud systems, demonstrating significant alert reduction and enhanced rule quality.
Contribution
The paper introduces AlertGuardian, a novel alert management framework integrating LLMs and graph models for effective alert denoising, summarization, and rule refinement in cloud systems.
Findings
94.8% alert reduction ratio
90.5% fault diagnosis accuracy
Improved 1,174 alert rules with 32% acceptance
Abstract
Alerts are critical for detecting anomalies in large-scale cloud systems, ensuring reliability and user experience. However, current systems generate overwhelming volumes of alerts, degrading operational efficiency due to ineffective alert life-cycle management. This paper details the efforts of Company-X to optimize alert life-cycle management, addressing alert fatigue in cloud systems. We propose AlertGuardian, a framework collaborating large language models (LLMs) and lightweight graph models to optimize the alert life-cycle through three phases: Alert Denoise uses graph learning model with virtual noise to filter noise, Alert Summary employs Retrieval Augmented Generation (RAG) with LLMs to create actionable summary, and Alert Rule Refinement leverages multi-agent iterative feedbacks to improve alert rule quality. Evaluated on four real-world datasets from Company-X's services,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Advanced Software Engineering Methodologies
