Implementation-Oblivious Transparent Checkpoint-Restart for MPI
Yao Xu, Leonid Belyaev, Twinkle Jain, Derek Schafer, Anthony Skjellum,, Gene Cooperman

TL;DR
This paper introduces MANA, a platform that enables transparent checkpointing of MPI applications across different implementations, allowing developers to test and deploy MPI workloads seamlessly on various standards-compliant systems.
Contribution
The paper presents a novel platform, MANA, that provides implementation-oblivious checkpointing for MPI, facilitating cross-implementation testing and deployment.
Findings
Supports major MPI implementations transparently
Enables 'develop once, run everywhere' MPI workflows
Improves flexibility and testing of MPI applications
Abstract
This work presents experience with traditional use cases of checkpointing on a novel platform. A single codebase (MANA) transparently checkpoints production workloads for major available MPI implementations: "develop once, run everywhere". The new platform enables application developers to compile their application against any of the available standards-compliant MPI implementations, and test each MPI implementation according to performance or other features.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Distributed systems and fault tolerance
