A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations
Nils Kohl, Johannes H\"otzer, Florian Schornbaum, Martin Bauer,, Christian Godenschwager, Harald K\"ostler, Britta Nestler, Ulrich R\"ude

TL;DR
This paper introduces a scalable, diskless checkpointing scheme for massively parallel simulations, enabling efficient recovery and resilience on exascale systems, demonstrated with large-scale physics simulations.
Contribution
The paper presents a novel distributed checkpointing method that is scalable, diskless, and integrated with MPI, suitable for exascale supercomputers and large-scale simulations.
Findings
Checkpoint creation takes only a few seconds at large scale.
The scheme scales almost perfectly up to over 260,000 processes.
Demonstrated effectiveness with large-scale physics simulations.
Abstract
Realistic simulations in engineering or in the materials sciences can consume enormous computing resources and thus require the use of massively parallel supercomputers. The probability of a failure increases both with the runtime and with the number of system components. For future exascale systems it is therefore considered critical that strategies are developed to make software resilient against failures. In this article, we present a scalable, distributed, diskless, and resilient checkpointing scheme that can create and recover snapshots of a partitioned simulation domain. We demonstrate the efficiency and scalability of the checkpoint strategy for simulations with up to billion computational cells executing on more than billion floating point values. A checkpoint creation is shown to require only a few seconds and the new checkpointing scheme scales almost perfectly up…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
