Fault Tolerant QR Factorization for General Matrices
Camille Coti

TL;DR
This paper introduces a fault-tolerant QR factorization algorithm for general matrices that uses communication-avoiding techniques and redundancy to recover from process failures without significant overhead.
Contribution
It proposes a novel fault-tolerant QR algorithm that enables process failure recovery with minimal additional computation and no impact on normal execution.
Findings
Enables recovery of failed process state from a single process.
Maintains efficiency during failure-free execution.
Uses structure of reduction to introduce redundancies.
Abstract
This paper presents a fault-tolerant algorithm for the QR factorization of general matrices. It relies on the communication-avoiding algorithm, and uses the structure of the reduction of each part of the computation to introduce redundancies that are sufficient to recover the state of a failed process. After a process has failed, its state can be recovered based on the data held by one process only. Besides, it does not add any significant operation in the critical path during failure-free execution.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Interconnection Networks and Systems · Petri Nets in System Modeling
