Prism: Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems
Jinyang Liu, Zhihan Jiang, Jiazhen Gu, Junjie Huang, Zhuangbin Chen,, Cong Feng, Zengyin Yang, Yongqiang Yang, Michael R. Lyu

TL;DR
Prism is a scalable, non-intrusive clustering method that reveals functional groups of cloud instances by analyzing communication and resource usage patterns, improving observability and reliability in large-scale cloud systems.
Contribution
We introduce Prism, a novel coarse-to-fine clustering approach that effectively identifies functional instance groups in massive cloud environments, surpassing existing methods.
Findings
Prism achieves a v-measure of ~0.95 on real-world Huawei Cloud data.
It outperforms state-of-the-art clustering solutions.
Prism enhances cloud system observability and reliability.
Abstract
Ensuring the reliability of cloud systems is critical for both cloud vendors and customers. Cloud systems often rely on virtualization techniques to create instances of hardware resources, such as virtual machines. However, virtualization hinders the observability of cloud systems, making it challenging to diagnose platform-level issues. To improve system observability, we propose to infer functional clusters of instances, i.e., groups of instances having similar functionalities. We first conduct a pilot study on a large-scale cloud system, i.e., Huawei Cloud, demonstrating that instances having similar functionalities share similar communication and resource usage patterns. Motivated by these findings, we formulate the identification of functional clusters as a clustering problem and propose a non-intrusive solution called Prism. Prism adopts a coarse-to-fine clustering strategy. It…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Data Stream Mining Techniques · Software System Performance and Reliability
