Checkpointing and Localized Recovery for Nested Fork-Join Programs

Claudia Fohry

arXiv:2102.12941·cs.DC·March 1, 2021·1 cites

Checkpointing and Localized Recovery for Nested Fork-Join Programs

Claudia Fohry

PDF

Open Access

TL;DR

This paper extends a low-overhead checkpointing and localized recovery technique from independent tasks to nested fork-join programs in distributed memory systems, enabling efficient fault recovery.

Contribution

It adapts an existing low-overhead checkpointing method to nested fork-join programs with work-stealing, maintaining minimal overheads.

Findings

01

Checkpointing overheads below 1% expected to be maintained

02

Localized recovery allows unaffected processes to continue

03

Algorithmic changes enable application to nested fork-join programs

Abstract

While checkpointing is typically combined with a restart of the whole application, localized recovery permits all but the affected processes to continue. In task-based cluster programming, for instance, the application can then be finished on the intact nodes, and the lost tasks be reassigned. This extended abstract suggests to adapt a checkpointing and localized recovery technique that has originally been developed for independent tasks to nested fork-join programs. We consider a Cilk-like work stealing scheme with work-first policy in a distributed memory setting, and describe the required algorithmic changes. The original technique has checkpointing overheads below 1% and neglectable costs for recovery, we expect the new algorithm to achieve a similar performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management