AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures

Jie Liu; Huanzhi Pu; Zhiru Zhang

arXiv:2604.17834·cs.DC·April 21, 2026

AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures

Jie Liu, Huanzhi Pu, Zhiru Zhang

PDF

TL;DR

AsyncSparse introduces two GPU kernels that leverage asynchronous features like TMA and warp specialization to significantly accelerate sparse matrix-matrix multiplication, outperforming prior methods.

Contribution

The paper systematically studies GPU asynchronous features and co-designs two kernels for structured and irregular sparsity, achieving notable performance improvements.

Findings

01

WCSR kernel outperforms prior SpMM kernels on SuiteSparse matrices.

02

BCSR kernel achieves 2.66x speedup on Qwen2.5-7B prefill at high sparsity.

03

Optimizations overlap data transfer with computation for better performance.

Abstract

Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel across scientific computing and machine learning. While prior work accelerates SpMM using Tensor Cores, no existing sparse kernel exploits the asynchronous features of modern GPU architectures, such as NVIDIA's Tensor Memory Accelerator (TMA) and warp specialization. This work systematically studies how these features impact SpMM performance and introduces two co-designed kernels. For structured sparsity, we optimize a warp-specialized producer-consumer pipeline overlapping TMA data transfer with WGMMA computation using Block Compressed Sparse Row (BCSR) format. For irregular sparsity, we design a Window Compressed Sparse Row (WCSR) kernel that loads the sparse operand via TMA and splits large row-windows across thread blocks for load balancing. Our WCSR kernel outperforms all prior SpMM kernels on SuiteSparse matrices…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.