BSODiag: A Global Diagnosis Framework for Batch Servers Outage in Large-scale Cloud Infrastructure Systems
Tao Duan, Runqing Chen, Pinghui Wang, Junzhou Zhao, Jiongzhou Liu,, Shujie Han, Yi Liu, Fan Xu

TL;DR
This paper introduces BSODiag, an unsupervised framework for diagnosing batch server outages in large-scale cloud systems, effectively analyzing failure causes using multi-source data and failure correlations.
Contribution
The paper presents a novel, lightweight, unsupervised diagnosis framework that models spatio-temporal failure correlations for batch server outages in cloud infrastructure.
Findings
Achieves 87.5% PR@3 in outage diagnosis
Attains 46.3% PCR, outperforming baselines
Effectively models failure correlations in complex systems
Abstract
Cloud infrastructure is the collective term for all physical devices within cloud systems. Failures within the cloud infrastructure system can severely compromise the stability and availability of cloud services. Particularly, batch servers outage, which is the most fatal failure, could result in the complete unavailability of all upstream services. In this work, we focus on the batch servers outage diagnosis problem, aiming to accurately and promptly analyze the root cause of outages to facilitate troubleshooting. However, our empirical study conducted in a real industrial system indicates that it is a challenging task. Firstly, the collected single-modal coarse-grained failure monitoring data (i.e., alert, incident, or change) in the cloud infrastructure system is insufficient for a comprehensive failure profiling. Secondly, due to the intricate dependencies among devices, outages are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Network Security and Intrusion Detection
