CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance
Faisal Shahzad, Jonas Thies, Moritz Kreutzer, Thomas Zeiser, Georg, Hager, Gerhard Wellein

TL;DR
CRAFT is a C++ library that simplifies implementing application-level checkpointing and fault tolerance in HPC, reducing overhead and effort for fault recovery.
Contribution
It introduces an extendable, easy-to-use library that combines checkpointing and dynamic process recovery for high-performance computing applications.
Findings
Reduces implementation effort for application-level checkpointing
Supports asynchronous checkpointing and node-level checkpointing
Thoroughly analyzed overheads with multiple benchmarks
Abstract
In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
