A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

Hossein Albakri; Kazem Cheshmi

arXiv:2506.15174·cs.PL·June 19, 2025

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

Hossein Albakri, Kazem Cheshmi

PDF

Open Access

TL;DR

This paper introduces a new compiler transformation called enumerate-and-sparse-coarsen that significantly speeds up sparse matrix multiplication on GPUs by improving data reuse and workload balance, outperforming existing libraries.

Contribution

The paper presents a novel compiler transformation that enhances sparse matrix multiplication performance on GPUs, addressing irregular memory access issues.

Findings

01

Achieves up to 2.27x speedup over cuSPARSE.

02

Improves data reuse in registers and caches.

03

Creates more balanced workloads for GPU resources.

Abstract

Sparse data structures are commonly used in neural networks to reduce the memory footprint. These data structures are compact but cause irregularities such as random memory accesses, which prevent efficient use of the memory hierarchy. GPUs are a common platform for machine learning practitioners, but running compact data structures on these devices often leads to slow-downs due to inefficient use of computing and memory resources. This paper proposes a new compiler transformation, enumerate-and-sparse-coarsen, that accelerates sparse matrix-matrix multiplication (SPMM) on GPU devices. The transformation increases data reuse in registers and caches while creating more balanced workloads for GPU computing resources. The transformation is tested on sparse neural networks in convolutional and transformer models. On an A100 GPU and across a columns of matrix B (bCols) in $A \times B = C$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Quantum Computing Algorithms and Architecture