Identifying Performance Issues in Cloud Service Systems Based on Relational-Temporal Features
Wenwei Gu, Jinyang Liu, Zhuangbin Chen, Jianping Zhang, Yuxin Su,, Jiazhen Gu, Cong Feng, Zengyin Yang, Yongqiang Yang, Michael Lyu

TL;DR
This paper introduces ISOLATE, a novel graph neural network-based method that combines relational and temporal features to accurately detect and localize performance issues in cloud systems, even amidst noisy data.
Contribution
ISOLATE is the first approach to integrate relational-temporal features with attention and positive unlabeled learning for performance issue detection in cloud systems.
Findings
Outperforms baseline models with 0.945 F1-score.
Achieves 0.920 Hit rate@3 in localization.
Effectively handles noisy metrics in large-scale cloud data.
Abstract
Cloud systems are susceptible to performance issues, which may cause service-level agreement violations and financial losses. In current practice, crucial metrics are monitored periodically to provide insight into the operational status of components. Identifying performance issues is often formulated as an anomaly detection problem, which is tackled by analyzing each metric independently. However, this approach overlooks the complex dependencies existing among cloud components. Some graph neural network-based methods take both temporal and relational information into account, however, the correlation violations in the metrics that serve as indicators of underlying performance issues are difficult for them to identify. Furthermore, a large volume of components in a cloud system results in a vast array of noisy metrics. This complexity renders it impractical for engineers to fully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Network Security and Intrusion Detection · Software System Performance and Reliability
