KEA: Tuning an Exabyte-Scale Data Infrastructure
Yiwen Zhu, Subru Krishnan, Konstantinos Karanasos, Isha Tarte, Conor, Power, Abhishek Modi, Manoj Kumar, Deli Zhang, Kartheek Muthyala, Nick, Jurgens, Sarvesh Sakalanaga, Sudhir Darbha, Minu Iyer, Ankita Agarwal, Carlo, Curino

TL;DR
This paper introduces KEA, an automated, data-driven system that optimizes the configuration of Microsoft's exabyte-scale data infrastructure, significantly improving efficiency and reducing costs through machine learning models and continuous tuning.
Contribution
KEA is the first system to automate tuning of exabyte-scale data infrastructure using machine learning, combining observational and conservative testing methods for broad application.
Findings
Achieved >60% CPU utilization across clusters.
Supported diverse applications with automated tuning.
Projected tens of millions of dollars in annual savings.
Abstract
Microsoft's internal big-data infrastructure is one of the largest in the world -- with over 300k machines running billions of tasks from over 0.6M daily jobs. Operating this infrastructure is a costly and complex endeavor, and efficiency is paramount. In fact, for over 15 years, a dedicated engineering team has tuned almost every aspect of this infrastructure, achieving state-of-the-art efficiency (>60% average CPU utilization across all clusters). Despite rich telemetry and strong expertise, faced with evolving hardware/software/workloads this manual tuning approach had reached its limit -- we had plateaued. In this paper, we present KEA, a multi-year effort to automate our tuning processes to be fully data/model-driven. KEA leverages a mix of domain knowledge and principled data science to capture the essence of our cluster dynamic behavior in a set of machine learning (ML) models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
