Workload Failure Prediction for Data Centers
Jie Li, Rui Wang, Ghazanfar Ali, Tommy Dang, Alan Sill, Yong Chen

TL;DR
This paper presents machine learning models trained on workload traces to predict data center workload failures, enabling proactive management and resource optimization with high accuracy.
Contribution
It introduces queue-time and runtime failure prediction models trained on large datasets, improving failure detection accuracy and resource efficiency in data centers.
Findings
Queue-time model predicts failures with 90.61% precision.
Runtime model achieves 97.75% prediction precision.
Integration reduces CPU and memory usage by up to 16.7% and 14.53%.
Abstract
Failed workloads that consumed significant computational resources in time and space affect the efficiency of data centers significantly and thus limit the amount of scientific work that can be achieved. While the computational power has increased significantly over the years, detection and prediction of workload failures have lagged far behind and will become increasingly critical as the system scale and complexity further increase. In this study, we analyze workload traces collected from a production cluster and train machine learning models on a large amount of data sets to predict workload failures. Our prediction models consist of a queue-time model that estimates the probability of workload failures before execution and a runtime model that predicts failures at runtime. Evaluation results show that the queue-time model and runtime model can predict workload failures with a maximum…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Software System Performance and Reliability
