Intelligent Pooling: Proactive Resource Provisioning in Large-scale   Cloud Service

Deepak Ravikumar; Alex Yeo; Yiwen Zhu; Aditya Lakra; Harsha; Nagulapalli; Santhosh Kumar Ravindran; Steve Suh; Niharika Dutta; Andrew; Fogarty; Yoonjae Park; Sumeet Khushalani; Arijit Tarafdar; Kunal Parekh,; Subru Krishnan

arXiv:2411.11326·cs.DB·November 19, 2024

Intelligent Pooling: Proactive Resource Provisioning in Large-scale Cloud Service

Deepak Ravikumar, Alex Yeo, Yiwen Zhu, Aditya Lakra, Harsha, Nagulapalli, Santhosh Kumar Ravindran, Steve Suh, Niharika Dutta, Andrew, Fogarty, Yoonjae Park, Sumeet Khushalani, Arijit Tarafdar, Kunal Parekh,, Subru Krishnan

PDF

TL;DR

This paper presents Intelligent Pooling, a proactive resource provisioning system for cloud Spark clusters that predicts usage patterns with machine learning to reduce costs and improve latency, saving millions annually.

Contribution

It introduces a hybrid ML model for accurate, low-latency prediction of resource demand and a dynamic pool sizing mechanism that minimizes operational costs in cloud environments.

Findings

01

Achieves up to 43% reduction in cluster idle time.

02

Reduces costs by optimizing pool size based on predicted demand.

03

Deployed in production, saving tens of millions annually.

Abstract

The proliferation of big data and analytic workloads has driven the need for cloud compute and cluster-based job processing. With Apache Spark, users can process terabytes of data at ease with hundreds of parallel executors. At Microsoft, we aim at providing a fast and succinct interface for users to run Spark applications, such as through creating simple notebook "sessions" by abstracting the underlying complexity of the cloud. Providing low latency access to Spark clusters and sessions is a challenging problem due to the large overheads of cluster creation and session startup. In this paper, we introduce Intelligent Pooling, a system for proactively provisioning compute resources to combat the aforementioned overheads. To reduce the COGS (cost-of-goods-sold), our system (1) predicts usage patterns using an innovative hybrid Machine Learning (ML) model with low latency and high…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.