Collective Vector Clocks: Low-Overhead Transparent Checkpointing for MPI
Yao Xu, Gene Cooperman

TL;DR
This paper introduces a novel, low-overhead, network-agnostic method for transparent checkpointing of MPI programs that efficiently handles collective operations without additional network traffic or code modifications.
Contribution
The work presents a new approach for checkpointing MPI applications that avoids drawbacks of existing solutions, working entirely above the network layer and supporting non-blocking collectives.
Findings
Low runtime overhead demonstrated in experiments
No additional network traffic required
Supports non-blocking collective operations
Abstract
Taking snapshots of the state of a distributed computation is useful for off-line analysis of the computational state, for later restarting from the saved snapshot, for cloning a copy of the computation, and for migration to a new cluster. The problem is made more difficult when supporting collective operations across processes, such as barrier, reduce operations, scatter and gather, etc. Some processes may have reached the barrier or other collective operation, while other processes wait a long time to reach that same barrier or collective operation. At least two solutions are well-known in the literature: (I) draining in-flight network messages and then freezing the network at checkpoint time; and (ii) adding a barrier prior to the collective operation, and either completing the operation or aborting the barrier if not all processes are present. Both solutions suffer important…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Distributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques
