Fast Outage Analysis of Large-scale Production Clouds with Service Correlation Mining
Yaohui Wang, Guozheng Li, Zijian Wang, Yu Kang, Yangfan Zhou, Hongyu, Zhang, Feng Gao, Jeffrey Sun, Li Yang, Pochian Lee, Zhangwei Xu, Pu Zhao, Bo, Qiao, Liqun Li, Xu Zhang, Qingwei Lin

TL;DR
This paper introduces COT, a novel outage triage method that leverages service correlation mining from historical data to accurately and efficiently identify root causes of outages in large-scale cloud systems.
Contribution
COT is the first approach to incorporate global service correlation analysis for outage root cause identification, improving accuracy and reducing manual effort.
Findings
COT achieves 82.1%~83.5% triage accuracy.
Outperforms existing methods by 28.0%~29.7%.
Validated on one year of Microsoft Azure data.
Abstract
Cloud-based services are surging into popularity in recent years. However, outages, i.e., severe incidents that always impact multiple services, can dramatically affect user experience and incur severe economic losses. Locating the root-cause service, i.e., the service that contains the root cause of the outage, is a crucial step to mitigate the impact of the outage. In current industrial practice, this is generally performed in a bootstrap manner and largely depends on human efforts: the service that directly causes the outage is identified first, and the suspected root cause is traced back manually from service to service during diagnosis until the actual root cause is found. Unfortunately, production cloud systems typically contain a large number of interdependent services. Such a manual root cause analysis is often time-consuming and labor-intensive. In this work, we propose COT,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Software Engineering Research · Network Security and Intrusion Detection
