Communication-avoiding Cholesky-QR2 for rectangular matrices
Edward Hutter, Edgar Solomonik

TL;DR
This paper presents a communication-avoiding parallel CholeskyQR2 algorithm for rectangular matrices, significantly reducing interprocessor communication and improving scalability on supercomputers for QR factorization tasks.
Contribution
It introduces a generalized parallelization of CholeskyQR2 over a 3D processor grid, achieving lower communication costs and demonstrating superior performance over existing methods.
Findings
Achieves up to 6 times less interprocessor communication.
Faster than ScaLAPACK's QR by up to 3.3x on large-scale systems.
Effective scalability demonstrated on supercomputers.
Abstract
Scalable QR factorization algorithms for solving least squares and eigenvalue problems are critical given the increasing parallelism within modern machines. We introduce a more general parallelization of the CholeskyQR2 algorithm and show its effectiveness for a wide range of matrix sizes. Our algorithm executes over a 3D processor grid, the dimensions of which can be tuned to trade-off costs in synchronization, interprocessor communication, computational work, and memory footprint. We implement this algorithm, yielding a code that can achieve a factor of less interprocessor communication on processors than any previous parallel QR implementation. Our performance study on Intel Knights-Landing and Cray XE supercomputers demonstrates the effectiveness of this CholeskyQR2 parallelization on a large number of nodes. Specifically, relative to ScaLAPACK's QR, on 1024…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
