Cost-aware Duration Prediction for Software Upgrades in Datacenters
Yi Ding, Aijia Gao, Thibaud Ryden, Michal Sedlak, Essam Ewaisha, Igor Marnat, Henry Hoffmann

TL;DR
This paper introduces Acela, a cost-aware duration prediction framework that enhances software upgrade scheduling efficiency in datacenters, leading to significant improvements in throughput and resource utilization.
Contribution
The paper presents the first comprehensive study and a novel prediction framework for software upgrade scheduling at datacenter scale.
Findings
Increases upgrade window utilization by 1.25X
Boosts scheduled and completed upgrades by 33% and 41%
Reduces cancellation rates by 2.4X
Abstract
Software upgrades are critical to maintaining server reliability in datacenters. While job duration prediction and scheduling have been extensively studied, the unique challenges posed by software upgrades remain largely under-explored. This paper presents the first in-depth investigation into software upgrade scheduling at datacenter scale. We begin by characterizing various types of upgrades and then frame the scheduling task as a constrained optimization problem. To address this problem, we introduce Acela, a cost-aware duration prediction framework designed to improve upgrade scheduling efficiency and throughput while meeting service-level objectives (SLOs). Acela accounts for asymmetric misprediction costs, strategically selects the best predictive models, and mitigates straggler-induced overestimations. Evaluations on Meta's production datacenter systems demonstrate that Acela…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Software System Performance and Reliability
