A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs
Hossein Albakri, Kazem Cheshmi

TL;DR
This paper introduces a new compiler transformation called enumerate-and-sparse-coarsen that significantly speeds up sparse matrix multiplication on GPUs by improving data reuse and workload balance, outperforming existing libraries.
Contribution
The paper presents a novel compiler transformation that enhances sparse matrix multiplication performance on GPUs, addressing irregular memory access issues.
Findings
Achieves up to 2.27x speedup over cuSPARSE.
Improves data reuse in registers and caches.
Creates more balanced workloads for GPU resources.
Abstract
Sparse data structures are commonly used in neural networks to reduce the memory footprint. These data structures are compact but cause irregularities such as random memory accesses, which prevent efficient use of the memory hierarchy. GPUs are a common platform for machine learning practitioners, but running compact data structures on these devices often leads to slow-downs due to inefficient use of computing and memory resources. This paper proposes a new compiler transformation, enumerate-and-sparse-coarsen, that accelerates sparse matrix-matrix multiplication (SPMM) on GPU devices. The transformation increases data reuse in registers and caches while creating more balanced workloads for GPU computing resources. The transformation is tested on sparse neural networks in convolutional and transformer models. On an A100 GPU and across a columns of matrix B (bCols) in …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Quantum Computing Algorithms and Architecture
