Checkpointing as a Service in Heterogeneous Cloud Environments

Jiajun Cao; Matthieu Simonin; Gene Cooperman; Christine Morin

arXiv:1411.1958·cs.DC·March 24, 2015

Checkpointing as a Service in Heterogeneous Cloud Environments

Jiajun Cao, Matthieu Simonin, Gene Cooperman, Christine Morin

PDF

Open Access

TL;DR

This paper presents a cloud-agnostic checkpointing service that enhances fault tolerance, supports long-running and distributed HPC applications, and enables cross-cloud migration, improving cloud resource management and application resilience.

Contribution

It introduces a non-invasive, uniform checkpoint-restart mechanism applicable across heterogeneous cloud platforms, facilitating fault tolerance and application migration.

Findings

01

Supports parallel and distributed computations over TCP and InfiniBand.

02

Enables migration of applications between different cloud platforms.

03

Proactively manages job failures and resource starvation.

Abstract

A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Distributed systems and fault tolerance · Distributed and Parallel Computing Systems