ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms
Lukas H\"ubner, Demian Hespe, Peter Sanders, Alexandros Stamatakis

TL;DR
ReStore is an in-memory data recovery framework for MPI applications that enables rapid fault recovery by storing data in memory, significantly reducing recovery times compared to traditional disk-based methods.
Contribution
The paper introduces ReStore, a novel in-memory recovery framework with a C++ library for MPI programs, enabling faster fault recovery and supporting workload shrinking.
Findings
Recovery times of milliseconds on up to 24,576 processors.
Substantial speedup in recovery for bioinformatics application.
Effective in both controlled and real-world environments.
Abstract
Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-failed processes reload data. We present an algorithmic framework and its C++ library implementation ReStore for MPI programs that enables recovery of data after process failures. By storing all required data in memory via an appropriate data distribution and replication, recovery is substantially faster than with standard checkpointing schemes that rely on a parallel file system. As the application developer can specify which data to load, we also support shrinking recovery instead of recovery using spare compute nodes. We evaluate ReStore in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Advanced Data Storage Technologies · Scientific Computing and Data Management
