Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach
Yao Xu, Gene Cooperman

TL;DR
This paper presents a practical, efficient, and network-agnostic algorithm for transparent MPI checkpointing using topological sort, enabling resilience for long-running parallel applications with minimal overhead.
Contribution
It introduces a novel topological sort-based algorithm for MPI checkpointing that overcomes previous limitations related to network dependence and runtime overhead.
Findings
Achieves low runtime overhead in MPI checkpointing.
Demonstrates scalability on real-world applications including VASP.
Valid for both blocking and non-blocking MPI collectives.
Abstract
MPI is the de facto standard for parallel computing on a cluster of computers. Checkpointing is an important component in any strategy for software resilience and for long-running jobs that must be executed by chaining together time-bounded resource allocations. This work solves an old problem: a practical and general algorithm for transparent checkpointing of MPI that is both efficient and compatible with most of the latest network software. Transparent checkpointing is attractive due to its generality and ease of use for most MPI application developers. Earlier efforts at transparent checkpointing for MPI, one decade ago, had two difficult problems: (i) by relying on a specific MPI implementation tied to a specific network technology; and (ii) by failing to demonstrate sufficiently low runtime overhead. Problem (i) (network dependence) was already solved in 2019 by MANA's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Distributed and Parallel Computing Systems · Advanced Data Storage Technologies
