Checkpoint-Restart Libraries Must Become More Fault Tolerant
Anthony Skjellum, Derek Schafer

TL;DR
This paper emphasizes the necessity for fault-tolerant checkpoint-restart libraries in MPI-based production environments, highlighting current deficiencies and proposing the need for enhanced fault detection, isolation, and recovery mechanisms.
Contribution
It identifies the lack of fault tolerance in existing MPI checkpoint libraries and advocates for the development of fault-tolerant extensions that support minimal detection, isolation, and recovery.
Findings
Current MPI checkpoint libraries are not fault tolerant.
Fault-tolerant extensions are needed for MPI checkpoint libraries.
Communication between MPI and checkpoint libraries can improve fault detection.
Abstract
Production MPI codes need checkpoint-restart (CPR) support. Clearly, checkpoint-restart libraries must be fault tolerant lest they open up a window of vulnerability for failures with byzantine outcomes. But, certain popular libraries that leverage MPI are evidently not fault tolerant. Nowadays, fault detection with automatic recovery without batch requeueing is a strong requirement for production environments. Thus, allowing deadlock and setting long timeouts are suboptimal for fault detection even when paired with conservative recovery from the penultimate checkpoint. When MPI is used as a communication mechanism within a CPR library, such libraries must offer fault-tolerant extensions with minimal detection, isolation, mitigation, and potential recovery semantics to aid the CPR's library fail-backward. Communication between MPI and the checkpoint library regarding system health may…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Radiation Effects in Electronics · Parallel Computing and Optimization Techniques
