Fault-tolerant linear solvers via selective reliability
Patrick G. Bridges, Kurt B. Ferreira, Michael A. Heroux, Mark Hoemmen

TL;DR
This paper introduces a fault-tolerant iterative linear solver that leverages selective reliability and a cross-layer framework to maintain convergence despite memory faults, reducing energy costs for reliable computing.
Contribution
It presents a novel cross-layer system and algorithms enabling iterative solvers to tolerate uncorrectable memory faults by selectively applying reliability, with demonstrated convergence in fault scenarios.
Findings
Solver performs as well as traditional methods without faults
Converges where other solvers fail under faults
Framework intercepts and reports faults to applications
Abstract
Energy increasingly constrains modern computer hardware, yet protecting computations and data against errors costs energy. This holds at all scales, but especially for the largest parallel computers being built and planned today. As processor counts continue to grow, the cost of ensuring reliability consistently throughout an application will become unbearable. However, many algorithms only need reliability for certain data and phases of computation. This suggests an algorithm and system codesign approach. We show that if the system lets applications apply reliability selectively, we can develop algorithms that compute the right answer despite faults. These "fault-tolerant" iterative methods either converge eventually, at a rate that degrades gracefully with increased fault rate, or return a clear failure indication in the rare case that they cannot converge. Furthermore, they store…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed systems and fault tolerance · Distributed and Parallel Computing Systems
