Efficient N-to-M Checkpointing Algorithm for Finite Element Simulations
David A. Ham, Vaclav Hapla, Matthew G. Knepley, Lawrence Mitchell, Koki Sagiyama

TL;DR
This paper presents a novel N-to-M checkpointing algorithm for finite element simulations that enhances efficiency and flexibility in saving/loading data across different process counts, demonstrated on large-scale supercomputing systems.
Contribution
The paper introduces a new N-to-M checkpointing algorithm enabling flexible process usage, integrated into PETSc and Firedrake, for large-scale finite element simulations.
Findings
Efficiently saved and loaded 8.2 billion degrees of freedom.
Implemented on 8,192 processes on ARCHER2 supercomputer.
Demonstrated improved flexibility in simulation restart and post-processing.
Abstract
In this work, we introduce a new algorithm for N-to-M checkpointing in finite element simulations. This new algorithm allows efficient saving/loading of functions representing physical quantities associated with the mesh representing the physical domain. Specifically, the algorithm allows for using different numbers of parallel processes for saving and loading, allowing for restarting and post-processing on the process count appropriate to the given phase of the simulation and other conditions. For demonstration, we implemented this algorithm in PETSc, the Portable, Extensible Toolkit for Scientific Computation, and added a convenient high-level interface into Firedrake, a system for solving partial differential equations using finite element methods. We evaluated our new implementation by saving and loading data involving 8.2 billion finite element degrees of freedom using 8,192…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications · Distributed and Parallel Computing Systems · Scientific Computing and Data Management
