Phoebe: A Learning-based Checkpoint Optimizer

Yiwen Zhu; Matteo Interlandi; Abhishek Roy; Krishnadhan Das; Hiren; Patel; Malay Bag; Hitesh Sharma; Alekh Jindal

arXiv:2110.02313·cs.DB·August 11, 2022

Phoebe: A Learning-based Checkpoint Optimizer

Yiwen Zhu, Matteo Interlandi, Abhishek Roy, Krishnadhan Das, Hiren, Patel, Malay Bag, Hitesh Sharma, Alekh Jindal

PDF

TL;DR

Phoebe is a learning-based checkpoint optimizer that predicts job execution metrics and optimally determines checkpoints, significantly reducing storage hotspots and recovery times in large-scale data processing.

Contribution

Introduces Phoebe, a novel machine learning-driven checkpoint optimizer that improves efficiency and fault recovery in big data systems by predicting execution parameters and solving an optimization problem.

Findings

01

Reduced temporary storage hotspots by over 70%

02

Enabled 68% faster job restarts after failures

03

Demonstrated effectiveness in production workloads

Abstract

Easy-to-use programming interfaces paired with cloud-scale processing engines have enabled big data system users to author arbitrarily complex analytical jobs over massive volumes of data. However, as the complexity and scale of analytical jobs increase, they encounter a number of unforeseen problems, hotspots with large intermediate data on temporary storage, longer job recovery time after failures, and worse query optimizer estimates being examples of issues that we are facing at Microsoft. To address these issues, we propose Phoebe, an efficient learning-based checkpoint optimizer. Given a set of constraints and an objective function at compile-time, Phoebe is able to determine the decomposition of job plans, and the optimal set of checkpoints to preserve their outputs to durable global storage. Phoebe consists of three machine learning predictors and one optimization module. For…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.