A Scalable and Extensible Checkpointing Scheme for Massively Parallel   Simulations

Nils Kohl; Johannes H\"otzer; Florian Schornbaum; Martin Bauer,; Christian Godenschwager; Harald K\"ostler; Britta Nestler; Ulrich R\"ude

arXiv:1708.08286·cs.DC·January 30, 2018·Int. J. High Perform. Comput. Appl.

A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations

Nils Kohl, Johannes H\"otzer, Florian Schornbaum, Martin Bauer,, Christian Godenschwager, Harald K\"ostler, Britta Nestler, Ulrich R\"ude

PDF

TL;DR

This paper introduces a scalable, diskless checkpointing scheme for massively parallel simulations, enabling efficient recovery and resilience on exascale systems, demonstrated with large-scale physics simulations.

Contribution

The paper presents a novel distributed checkpointing method that is scalable, diskless, and integrated with MPI, suitable for exascale supercomputers and large-scale simulations.

Findings

01

Checkpoint creation takes only a few seconds at large scale.

02

The scheme scales almost perfectly up to over 260,000 processes.

03

Demonstrated effectiveness with large-scale physics simulations.

Abstract

Realistic simulations in engineering or in the materials sciences can consume enormous computing resources and thus require the use of massively parallel supercomputers. The probability of a failure increases both with the runtime and with the number of system components. For future exascale systems it is therefore considered critical that strategies are developed to make software resilient against failures. In this article, we present a scalable, distributed, diskless, and resilient checkpointing scheme that can create and recover snapshots of a partitioned simulation domain. We demonstrate the efficiency and scalability of the checkpoint strategy for simulations with up to $40$ billion computational cells executing on more than $400$ billion floating point values. A checkpoint creation is shown to require only a few seconds and the new checkpointing scheme scales almost perfectly up…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.