Anomaly Analysis for Co-located Datacenter Workloads in the Alibaba Cluster
Rui Ren, Jinheng Li, Lei Wang, Jianfeng Zhan, Zheng Cao

TL;DR
This paper analyzes Alibaba's co-located datacenter workloads to understand anomalies, revealing resource imbalance, workload distribution patterns, and causes of anomalies like scheduling issues and system failures.
Contribution
It provides a detailed anomaly analysis of Alibaba's co-located workloads using a new dataset, highlighting resource imbalance and workload distribution patterns.
Findings
Resource utilization is imbalanced across machines.
Machines can be classified into 8 workload distribution categories.
Anomalies mainly caused by scheduling issues and workload imbalance.
Abstract
In warehouse-scale cloud datacenters, co-locating online services and offline batch jobs is an efficient approach to improving datacenter utilization. To better facilitate the understanding of interactions among the co-located workloads and their real-world operational demands, Alibaba recently released a cluster usage and co-located workload dataset, which is the first publicly dataset with precise information about the category of each job. In this paper, we perform a deep analysis on the released Alibaba workload dataset, from the perspective of anomaly analysis and diagnosis. Through data preprocessing, node similarity analysis based on Dynamic Time Warping (DTW), co-located workloads characteristics analysis and anomaly analysis based on iForest, we reveals several insights including: (1) The performance discrepancy of machines in Alibaba's production cluster is relatively large,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Software System Performance and Reliability
