The Performance of Low-Synchronization Variants of Reorthogonalized Block Classical Gram--Schmidt
Erin Carson, Yuxin Ma

TL;DR
This paper evaluates low-synchronization variants of the block classical Gram-Schmidt algorithm for distributed systems, demonstrating significant speedups and recommending the most stable variants for practical QR factorization.
Contribution
It provides a performance comparison of recent low-synchronization BCGS variants, highlighting the stability and efficiency of BCGSI+P-1S and BCGSI+P-2S in distributed environments.
Findings
BCGSI+P-1S achieves up to 4x speedup over classical BCGS.
BCGSI+P-2S achieves up to 2x speedup.
Both variants outperform less stable counterparts in stability and performance.
Abstract
Numerous applications, such as Krylov subspace solvers, make extensive use of the block classical Gram-Schmidt (BCGS) algorithm and its reorthogonalized variants for orthogonalizing a set of vectors. For large-scale problems in distributed memory settings, the communication cost, particularly the global synchronization cost, is a major performance bottleneck. In recent years, many low-synchronization BCGS variants have been proposed in an effort to reduce the number of synchronization points. The work [E. Carson, Y. Ma, arXiv preprint 2411.07077] recently proposed stable one-synchronization and two-synchronization variants of BCGS, i.e., BCGSI+P-1S and BCGSI+P-2S. In this work, we evaluate the performance of BCGSI+P-1S and BCGSI+P-2S on a distributed memory system compared to other well-known low-synchronization BCGS variants. In comparison to the classical reorthogonalized BCGS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
