Workload Failure Prediction for Data Centers

Jie Li; Rui Wang; Ghazanfar Ali; Tommy Dang; Alan Sill; Yong Chen

arXiv:2301.05176·cs.DC·January 13, 2023

Workload Failure Prediction for Data Centers

Jie Li, Rui Wang, Ghazanfar Ali, Tommy Dang, Alan Sill, Yong Chen

PDF

Open Access

TL;DR

This paper presents machine learning models trained on workload traces to predict data center workload failures, enabling proactive management and resource optimization with high accuracy.

Contribution

It introduces queue-time and runtime failure prediction models trained on large datasets, improving failure detection accuracy and resource efficiency in data centers.

Findings

01

Queue-time model predicts failures with 90.61% precision.

02

Runtime model achieves 97.75% prediction precision.

03

Integration reduces CPU and memory usage by up to 16.7% and 14.53%.

Abstract

Failed workloads that consumed significant computational resources in time and space affect the efficiency of data centers significantly and thus limit the amount of scientific work that can be achieved. While the computational power has increased significantly over the years, detection and prediction of workload failures have lagged far behind and will become increasingly critical as the system scale and complexity further increase. In this study, we analyze workload traces collected from a production cluster and train machine learning models on a large amount of data sets to predict workload failures. Our prediction models consist of a queue-time model that estimates the probability of workload failures before execution and a runtime model that predicts failures at runtime. Evaluation results show that the queue-time model and runtime model can predict workload failures with a maximum…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Software System Performance and Reliability