Spot-on: A Checkpointing Framework for Fault-Tolerant Long-running Workloads on Cloud Spot Instances
Ashley Tung, Haiyan Wang, Yue Li, Zhong Wang, and Jingchao Sun

TL;DR
Spot-on is a flexible checkpointing framework that enables fault-tolerant execution of long-running workloads on cloud spot instances, significantly reducing costs and runtime despite instance evictions.
Contribution
The paper introduces Spot-on, a universal checkpointing framework compatible with major cloud providers, supporting both application-specific and transparent checkpointing for fault tolerance.
Findings
Supports fault-tolerant long-running workloads on spot instances
Reduces runtime by up to 40% with transparent checkpointing
Achieves cost savings of up to 86% compared to on-demand instances
Abstract
Spot instances offer a cost-effective solution for applications running in the cloud computing environment. However, it is challenging to run long-running jobs on spot instances because they are subject to unpredictable evictions. Here, we present Spot-on, a generic software framework that supports fault-tolerant long-running workloads on spot instances through checkpoint and restart. Spot-on leverages existing checkpointing packages and is compatible with the major cloud vendors. Using a genomics application as a test case, we demonstrated that Spot-on supports both application-specific and transparent checkpointing methods. Compared to running applications using on-demand instances, it allows the completion of these workloads for a significant reduction in computing costs. Compared to running applications using application-specific checkpoint mechanisms, transparent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Cloud Computing and Resource Management · IoT and Edge/Fog Computing
