TL;DR
This paper investigates the time-dependent preemption patterns of transient cloud VMs, develops a new probabilistic model, and creates resource management policies that significantly improve reliability and reduce costs in scientific computing workloads.
Contribution
It introduces a novel bathtub-shaped preemption probability model for temporally constrained preemptions and demonstrates its effectiveness in optimizing job scheduling and checkpointing policies.
Findings
Preemptions are time-dependent with a bathtub shape.
Existing memoryless models are inadequate for these preemptions.
Model-based policies can halve job failure probability and reduce costs by 5x.
Abstract
Transient cloud servers such as Amazon Spot instances, Google Preemptible VMs, and Azure Low-priority batch VMs, can reduce cloud computing costs by as much as , but can be unilaterally preempted by the cloud provider. Understanding preemption characteristics (such as frequency) is a key first step in minimizing the effect of preemptions on application performance, availability, and cost. However, little is understood about temporally constrained preemptions---wherein preemptions must occur in a given time window. We study temporally constrained preemptions by conducting a large scale empirical study of Google's Preemptible VMs (that have a maximum lifetime of 24 hours), develop a new preemption probability model, new model-driven resource management policies, and implement them in a batch computing service for scientific computing workloads. Our statistical and experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
