TwinCG: Dual Thread Redundancy with Forward Recovery for Conjugate Gradient Methods
Kiril Dichev, Dimitrios S. Nikolopoulos

TL;DR
This paper introduces TwinCG, a fault-tolerant conjugate gradient solver using dual thread redundancy that offers improved fault detection and correction with minimal overhead, outperforming existing ABFT methods.
Contribution
The paper proposes TwinCG, a novel dual-thread redundancy approach for CG that enhances fault tolerance with low overhead and effective forward recovery capabilities.
Findings
TwinCG incurs only 5-6% runtime overhead before parallelization.
It reliably performs forward recovery in the presence of faults.
Outperforms state-of-the-art ABFT solutions in fault scenarios.
Abstract
Even though iterative solvers like the Conjugate Gradients method (CG) have been studied for over fifty years, fault tolerance for such solvers has seen much attention in recent years. For iterative solvers, two major reliable strategies of recovery exist: checkpoint-restart for backward recovery, or some type of redundancy technique for forward recovery. Important redundancy techniques like ABFT techniques for sparse matrix-vector products (SpMxV) have recently been proposed, which increase the resilience of CG methods. These techniques offer limited recovery options, and introduce a tolerable overhead. In this work, we study a more powerful resilience concept, which is redundant multithreading. It offers more generic and stronger recovery guarantees, including any soft faults in CG iterations (among others covering ABFT SpMxV), but also requires more resources. We carefully study this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Matrix Theory and Algorithms · Interconnection Networks and Systems
