DMTCP Checkpoint/Restart of MPI Programs via Proxies
Gregory Michael Price

TL;DR
This paper introduces a proxy-based method for checkpointing and restarting MPI programs, enabling portability across different MPI implementations and simplifying the process without requiring knowledge of underlying mechanisms.
Contribution
It presents a novel proxy approach that allows MPI programs to be checkpointed and restarted independently of specific MPI implementations, enhancing portability and flexibility.
Findings
MPI programs can be checkpointed and restarted using proxies.
Proxies enable cross-implementation checkpointing and restarting.
The method simplifies MPI program management without detailed knowledge of underlying mechanisms.
Abstract
MPI accomplishes portable, standardized message-passing between processes by exposing a standard API that hides the implementation of the underlying mechanism for message passing. Until now, checkpointing an MPI program required knowledge of these underlying mechanisms. Through the addition of a proxy, we demonstrate that MPI programs can be checkpointed and restarted regardless of the MPI implementation utilized. Further, proxies may enable MPI programs to be checkpointed on one MPI implementation, and restarted on another.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Distributed systems and fault tolerance
