The Detection and Correction of Silent Errors in Pipelined Krylov Subspace Methods
Erin Claire Carson, Jakub Herc\'ik

TL;DR
This paper presents a novel algorithm-based method for detecting silent hardware errors in pipelined Krylov subspace methods, enhancing fault tolerance in large-scale computations by monitoring finite precision bounds.
Contribution
It introduces a finite precision error analysis technique to detect silent errors in pipelined Krylov methods and proposes a fault-tolerant variant with adaptive detection strategies.
Findings
Effective detection of silent errors demonstrated in numerical experiments
Adaptive detection criteria improve fault detection reliability
Enhanced fault tolerance in pipelined Krylov methods achieved
Abstract
As computational machines become larger and more complex, the probability of hardware failure rises. ``Silent errors'', or bit flips, may not be immediately apparent but can cause detrimental effects to algorithm behavior. In this work, we examine an algorithm-based approach to silent error detection in the context of pipelined Krylov subspace methods, in particular, Pipe-PR-CG, for the solution of linear systems. Our approach is based on using finite precision error analysis to bound the differences between quantities which should be equal in exact arithmetic. By monitoring select quantities during the iteration, we can detect when these bounds are violated, which indicates that a silent error has occurred. We use this approach to develop a fault-tolerant variant and also suggest a strategy for dynamically adapting the detection criteria. Our numerical experiments demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMatrix Theory and Algorithms
