A Backward/Forward Recovery Approach for the Preconditioned Conjugate Gradient Method
Massimiliano Fasi, Julien Langou, Yves Robert, Bora Ucar

TL;DR
This paper proposes a novel fault-tolerance approach for the preconditioned conjugate gradient method by combining checkpointing with algorithm-based fault tolerance (ABFT) for error detection and correction, enabling forward recovery.
Contribution
It introduces a new recovery scheme that integrates ABFT with checkpointing for iterative solvers, allowing error correction without rollback or re-execution.
Findings
ABFT enables error detection and correction in iterative solvers.
The performance model helps compare different fault-tolerance schemes.
Simulations validate the effectiveness of the proposed approach.
Abstract
Several recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen [PPoPP'13, pp. 167--176] has shown how to combine such a verification mechanism (a stability test checking the orthogonality of two vectors and recomputing the residual) with checkpointing: the idea is to verify every iterations, and to checkpoint every iterations. When a silent error is detected by the verification mechanism, one can rollback to and re-execute from the last checkpoint. In this paper, we also propose to combine checkpointing and verification, but we use algorithm-based fault tolerance (ABFT) rather than stability tests. ABFT can be used for error detection, but also for error detection and correction, allowing a forward recovery (and no rollback nor re-execution) when a single error is detected. We introduce an abstract performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMatrix Theory and Algorithms · Parallel Computing and Optimization Techniques · Advanced Optimization Algorithms Research
