Coded Computing for Fault-Tolerant Parallel QR Decomposition
Quang Minh Nguyen, Iain Weissburg, Haewon Jeong

TL;DR
This paper introduces a fault-tolerant parallel QR decomposition algorithm using coded computing, which enhances resilience against node failures with minimal overhead and guarantees orthogonality restoration.
Contribution
It constructs a checksum-generator matrix satisfying the post-orthogonalization condition and proposes an in-node systematic MDS coding strategy with minimal checksum requirements.
Findings
The proposed code satisfies the MDS property with high probability.
The in-node systematic MDS coding achieves the lower bound on checksum number.
Experiments show negligible overhead of the fault-tolerant framework.
Abstract
QR decomposition is an essential operation for solving linear equations and obtaining least-squares solutions. In high-performance computing systems, large-scale parallel QR decomposition often faces node faults. We address this issue by proposing a fault-tolerant algorithm that incorporates `coded computing' into the parallel Gram-Schmidt method, commonly used for QR decomposition. Coded computing introduces error-correcting codes into computational processes to enhance resilience against intermediate failures. While traditional coding strategies cannot preserve the orthogonality of , recent work has proven a post-orthogonalization condition that allows low-cost restoration of the degraded orthogonality. In this paper, we construct a checksum-generator matrix for multiple-node failures that satisfies the post-orthogonalization condition and prove that our code satisfies the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Error Correcting Code Techniques · Interconnection Networks and Systems
