Towards Data-Driven Autonomics in Data Centers
Alina S\^irbu, Ozalp Babaoglu

TL;DR
This paper demonstrates how data-driven predictive models, built from large-scale data center logs, can forecast node failures with reasonable accuracy, paving the way for autonomous management of data centers.
Contribution
It introduces a methodology for building and evaluating a predictive failure model using ensemble classifiers on a large Google dataset, advancing autonomous data center management.
Findings
Achieved true positive rates between 27% and 88% at 5% false positive rate.
Predicted failures within a 24-hour window using ensemble classifiers.
Provided publicly available scripts for data processing and modeling.
Abstract
Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using generated data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating a predictive model for node failures. We use BigQuery, the big…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
