Recovery of Distributed Iterative Solvers for Linear Systems Using Non-Volatile RAM
Yehonatan Fridman, Yaniv Snir, Harel Levin, Danny Hendler, Hagit, Attiya, Gal Oren

TL;DR
This paper proposes an improved recovery mechanism for distributed iterative linear solvers in HPC systems using non-volatile RAM, reducing overhead and enhancing resilience compared to traditional methods.
Contribution
It introduces in-NVRAM ESR, a novel fault recovery approach leveraging NVRAM and MPI One-Sided Communication to improve efficiency and resilience of iterative solvers.
Findings
Reduces memory footprint compared to in-RAM ESR.
Decreases recovery time overhead.
Provides full resilience with NVRAM-based ESR.
Abstract
HPC systems are a critical resource for scientific research. The increased demand for computational power and memory ushers in the exascale era, in which supercomputers are designed to provide enormous computing power to meet these needs. These complex supercomputers consist of numerous compute nodes and are consequently expected to experience frequent faults and crashes. Mathematical solvers, in particular, iterative linear solvers are key building block in numerous large-scale scientific applications. Consequently, supporting the recovery of distributed solvers is necessary for scaling scientific applications to exascale platforms. Previous recovery methods for iterative solvers are based on Checkpoint-Restart (CR), which incurs high fault tolerance overhead, or intrinsic fault tolerance, which require extra computation time to converge after failures. Exact state reconstruction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed systems and fault tolerance · Distributed and Parallel Computing Systems
