Phoebe: A Learning-based Checkpoint Optimizer
Yiwen Zhu, Matteo Interlandi, Abhishek Roy, Krishnadhan Das, Hiren, Patel, Malay Bag, Hitesh Sharma, Alekh Jindal

TL;DR
Phoebe is a learning-based checkpoint optimizer that predicts job execution metrics and optimally determines checkpoints, significantly reducing storage hotspots and recovery times in large-scale data processing.
Contribution
Introduces Phoebe, a novel machine learning-driven checkpoint optimizer that improves efficiency and fault recovery in big data systems by predicting execution parameters and solving an optimization problem.
Findings
Reduced temporary storage hotspots by over 70%
Enabled 68% faster job restarts after failures
Demonstrated effectiveness in production workloads
Abstract
Easy-to-use programming interfaces paired with cloud-scale processing engines have enabled big data system users to author arbitrarily complex analytical jobs over massive volumes of data. However, as the complexity and scale of analytical jobs increase, they encounter a number of unforeseen problems, hotspots with large intermediate data on temporary storage, longer job recovery time after failures, and worse query optimizer estimates being examples of issues that we are facing at Microsoft. To address these issues, we propose Phoebe, an efficient learning-based checkpoint optimizer. Given a set of constraints and an objective function at compile-time, Phoebe is able to determine the decomposition of job plans, and the optimal set of checkpoints to preserve their outputs to durable global storage. Phoebe consists of three machine learning predictors and one optimization module. For…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
