Checkpointing and Localized Recovery for Nested Fork-Join Programs
Claudia Fohry

TL;DR
This paper extends a low-overhead checkpointing and localized recovery technique from independent tasks to nested fork-join programs in distributed memory systems, enabling efficient fault recovery.
Contribution
It adapts an existing low-overhead checkpointing method to nested fork-join programs with work-stealing, maintaining minimal overheads.
Findings
Checkpointing overheads below 1% expected to be maintained
Localized recovery allows unaffected processes to continue
Algorithmic changes enable application to nested fork-join programs
Abstract
While checkpointing is typically combined with a restart of the whole application, localized recovery permits all but the affected processes to continue. In task-based cluster programming, for instance, the application can then be finished on the intact nodes, and the lost tasks be reassigned. This extended abstract suggests to adapt a checkpointing and localized recovery technique that has originally been developed for independent tasks to nested fork-join programs. We consider a Cilk-like work stealing scheme with work-first policy in a distributed memory setting, and describe the required algorithmic changes. The original technique has checkpointing overheads below 1% and neglectable costs for recovery, we expect the new algorithm to achieve a similar performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management
