FT-GCR: a fault-tolerant generalized conjugate residual elliptic solver
Mike Gillard, Tommaso Benacchio

TL;DR
FT-GCR is a novel fault-tolerant iterative solver for elliptic problems that detects and recovers from hardware-induced soft faults, enhancing reliability in high-performance computing environments.
Contribution
The paper introduces FT-GCR, a new fault-tolerant Krylov solver that detects and recovers from soft faults during iterative solutions of elliptic equations.
Findings
Effective detection of bit-flips during iterations.
Successful recovery from soft faults with minimal performance loss.
Robustness demonstrated across various grid sizes and fault scenarios.
Abstract
With the steady advance of high performance computing systems featuring smaller and smaller hardware components, the systems and algorithms used for numerical simulations increasingly contend with disruptions caused by hardware failures and bit-levels misrepresentations of computing data. In numerical frameworks exploiting massive processing power, the solution of linear systems often represents the most computationally intensive component. Given the large amount of repeated operations involved, iterative solvers are particularly vulnerable to bit-flips. A new method named FT-GCR is proposed here that supplies the preconditioned Generalized Conjugate Residual Krylov solver with detection of, and recovery from, soft faults. The algorithm tests on the monotonic decrease of the residual norm and, upon failure, restarts the iteration within the local Krylov space. Numerical experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
