Exploiting Redundant Computation in Communication-Avoiding Algorithms for Algorithm-Based Fault Tolerance
Camille Coti

TL;DR
This paper proposes leveraging redundant computations in communication-avoiding algorithms, specifically QR factorization, to enhance fault tolerance by tolerating failures through exploiting inherent redundancy.
Contribution
It introduces a novel approach to utilize redundancy in communication-avoiding algorithms for improving fault tolerance, demonstrated with QR factorization.
Findings
Redundant computations can be exploited for fault tolerance.
The proposed method tolerates multiple failures depending on redundancy.
Evaluation shows increased fault resilience in QR factorization.
Abstract
Communication-avoiding algorithms allow redundant computations to minimize the number of inter-process communications. In this paper, we propose to exploit this redundancy for fault-tolerance purpose. We illustrate this idea with QR factorization of tall and skinny matrices, and we evaluate the number of failures our algorithm can tolerate under different semantics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Cloud Computing and Resource Management · Interconnection Networks and Systems
