Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels
Dane C. Lacey, Christie L. Alappat, Florian Lange, Georg Hager, Holger Fehske, Gerhard Wellein

TL;DR
This paper introduces a cache blocking method for distributed-memory parallel matrix power kernels, significantly improving performance for sparse matrix computations across modern architectures.
Contribution
It develops a novel distributed cache blocking technique that combines RACE's data reuse with MPI communication, enabling efficient parallelization of MPK.
Findings
Achieved up to 4x speed-up on 832 cores of an Intel Sapphire Rapids cluster.
Demonstrated substantial performance improvements across various scientific sparse matrices.
Extended RACE's cache reuse concept to distributed-memory systems with explicit communication.
Abstract
Sparse matrix-vector products (SpMVs) are a bottleneck in many scientific codes. Due to the heavy strain on the main memory interface from loading the sparse matrix and the possibly irregular memory access pattern, SpMV typically exhibits low arithmetic intensity. Repeating these products multiple times with the same matrix is required in many algorithms. This so-called matrix power kernel (MPK) provides an opportunity for data reuse since the same matrix data is loaded from main memory multiple times, an opportunity that has only recently been exploited successfully with the Recursive Algebraic Coloring Engine (RACE). Using RACE, one considers a graph based formulation of the SpMV and employs s level-based implementation of SpMV for reuse of relevant matrix data. However, the underlying data dependencies have restricted the use of this concept to shared memory parallelization and thus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
