On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal   Matrix Factorizations

Grzegorz Kwasniewski; Marko Kabi\'c; Tal Ben-Nun; Alexandros Nikolaos; Ziogas; Jens Eirik Saethre; Andr\'e Gaillard; Timo Schneider; Maciej Besta,; Anton Kozhevnikov; Joost VandeVondele; Torsten Hoefler

arXiv:2108.09337·cs.DC·April 26, 2023

On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations

Grzegorz Kwasniewski, Marko Kabi\'c, Tal Ben-Nun, Alexandros Nikolaos, Ziogas, Jens Eirik Saethre, Andr\'e Gaillard, Timo Schneider, Maciej Besta,, Anton Kozhevnikov, Joost VandeVondele, Torsten Hoefler

PDF

TL;DR

This paper introduces communication-optimal algorithms for matrix factorizations, significantly reducing data movement and outperforming existing libraries on large-scale parallel systems.

Contribution

It develops a theoretical framework for parallel I/O lower bounds and presents new algorithms for Cholesky and LU factorizations that are near-optimal in communication.

Findings

01

Empirical results match theoretical predictions of reduced communication.

02

Code outperforms Intel MKL, SLATE, CANDMC, and CAPITAL libraries.

03

Achieves up to three times faster solutions on large matrices.

Abstract

Matrix factorizations are among the most important building blocks of scientific computing. State-of-the-art libraries, however, are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for Cholesky and LU factorizations that utilize an asymptotically communication-optimal 2.5D decomposition. We first establish a theoretical framework for deriving parallel I/O lower bounds for linear algebra kernels, and then utilize its insights to derive Cholesky and LU schedules, both communicating N^3/(P*sqrt(M)) elements per processor, where M is the local memory size. The empirical results match our theoretical analysis: our implementations communicate significantly less than Intel MKL, SLATE, and the asymptotically communication-optimal CANDMC and CAPITAL libraries. Our code outperforms these state-of-the-art libraries in almost all tested…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.