Towards Operator-less Data Centers Through Data-Driven, Predictive, Proactive Autonomics
Alina S\^irbu, Ozalp Babaoglu

TL;DR
This paper demonstrates that data-driven predictive models can effectively forecast data center node failures, enabling proactive management and reducing operational costs in large-scale data centers.
Contribution
It introduces a practical approach using ensemble classifiers trained on live data to predict node failures, advancing autonomous data center management.
Findings
Predictive models achieve up to 88% true positive rate at 5% false positive rate.
High precision (50-72%) in failure prediction supports effective job rerouting.
Data-driven models are practical for large-scale, real-world data center management.
Abstract
Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using live data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating predictive models for node failures. Our results support the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
