Checkpointing as a Service in Heterogeneous Cloud Environments
Jiajun Cao, Matthieu Simonin, Gene Cooperman, Christine Morin

TL;DR
This paper presents a cloud-agnostic checkpointing service that enhances fault tolerance, supports long-running and distributed HPC applications, and enables cross-cloud migration, improving cloud resource management and application resilience.
Contribution
It introduces a non-invasive, uniform checkpoint-restart mechanism applicable across heterogeneous cloud platforms, facilitating fault tolerance and application migration.
Findings
Supports parallel and distributed computations over TCP and InfiniBand.
Enables migration of applications between different cloud platforms.
Proactively manages job failures and resource starvation.
Abstract
A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed systems and fault tolerance · Distributed and Parallel Computing Systems
