AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures
Jie Liu, Huanzhi Pu, Zhiru Zhang

TL;DR
AsyncSparse introduces two GPU kernels that leverage asynchronous features like TMA and warp specialization to significantly accelerate sparse matrix-matrix multiplication, outperforming prior methods.
Contribution
The paper systematically studies GPU asynchronous features and co-designs two kernels for structured and irregular sparsity, achieving notable performance improvements.
Findings
WCSR kernel outperforms prior SpMM kernels on SuiteSparse matrices.
BCSR kernel achieves 2.66x speedup on Qwen2.5-7B prefill at high sparsity.
Optimizations overlap data transfer with computation for better performance.
Abstract
Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel across scientific computing and machine learning. While prior work accelerates SpMM using Tensor Cores, no existing sparse kernel exploits the asynchronous features of modern GPU architectures, such as NVIDIA's Tensor Memory Accelerator (TMA) and warp specialization. This work systematically studies how these features impact SpMM performance and introduces two co-designed kernels. For structured sparsity, we optimize a warp-specialized producer-consumer pipeline overlapping TMA data transfer with WGMMA computation using Block Compressed Sparse Row (BCSR) format. For irregular sparsity, we design a Window Compressed Sparse Row (WCSR) kernel that loads the sparse operand via TMA and splits large row-windows across thread blocks for load balancing. Our WCSR kernel outperforms all prior SpMM kernels on SuiteSparse matrices…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
