KPIRoot+: An Efficient Integrated Framework for Anomaly Detection and Root Cause Analysis in Large-Scale Cloud Systems
Wenwei Gu, Renyi Zhong, Guangba Yu, Xinying Sun, Jinyang Liu, Yintong Huo, Zhuangbin Chen, Jianping Zhang, Jiazhen Gu, Yongqiang Yang, Michael R. Lyu

TL;DR
KPIRoot+ is an efficient framework that combines similarity and causality analysis for anomaly detection and root cause analysis in large-scale cloud systems, addressing limitations of previous methods.
Contribution
It introduces KPIRoot+ which enhances anomaly detection accuracy and efficiency by improving KPI representation and analysis techniques in cloud environments.
Findings
Outperforms eight state-of-the-art baselines by 2.9% to 35.7%.
Reduces analysis time cost by 34.7%.
Successfully deployed in a large-scale cloud provider's environment.
Abstract
To ensure the reliability of cloud systems, their performance is monitored using KPIs (key performance indicators). When issues arise, root cause localization identifies KPIs responsible for service degradation, aiding in quick diagnosis and resolution. Traditional methods rely on similarity calculations, which can be ineffective in complex, interdependent cloud environments. While deep learning-based approaches model these dependencies better, they often face challenges such as high computational demands and lack of interpretability. To address these issues, KPIRoot is proposed as an efficient method combining similarity and causality analysis. It uses symbolic aggregate approximation for compact KPI representation, improving analysis efficiency. However, deployment in Cloud H revealed two drawbacks: 1) threshold-based anomaly detection misses some performance anomalies, and 2) SAX…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Anomaly Detection Techniques and Applications
