Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC
Madan Timalsina, Lisa Gerhardt, Nicholas Tyler, Johannes P. Blaschke,, William Arndt

TL;DR
This paper evaluates DMTCP-based checkpoint-restart mechanisms in HPC environments, particularly within containers, demonstrating improved efficiency and reliability for complex, long-running scientific computations on NERSC's supercomputing system.
Contribution
It provides an in-depth analysis of DMTCP's effectiveness in HPC container environments, highlighting practical improvements and implementation strategies.
Findings
DMTCP enhances checkpoint-restart efficiency in HPC applications.
Containerization with Shifter and Podman-HPC ensures consistent performance.
The approach improves reliability of long-running scientific computations.
Abstract
This paper presents an in-depth examination of checkpoint-restart mechanisms in High-Performance Computing (HPC). It focuses on the use of Distributed MultiThreaded CheckPointing (DMTCP) in various computational settings, including both within and outside of containers. The study is grounded in real-world applications running on NERSC Perlmutter, a state-of-the-art supercomputing system. We discuss the advantages of checkpoint-restart (C/R) in managing complex and lengthy computations in HPC, highlighting its efficiency and reliability in such environments. The role of DMTCP in enhancing these workflows, especially in multi-threaded and distributed applications, is thoroughly explored. Additionally, the paper delves into the use of HPC containers, such as Shifter and Podman-HPC, which aid in the management of computational tasks, ensuring uniform performance across different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Advanced Data Storage Technologies · Parallel Computing and Optimization Techniques
