Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix   Multiplication with GPU Tensor Cores

Haisha Zhao; San Li; Jiaheng Wang; Chunbao Zhou; Jue Wang; Zhikuang; Xin; Shunde Li; Zhiqiang Liang; Zhijie Pan; Fang Liu; Yan Zeng; Yangang Wang,; Xuebin Chi

arXiv:2501.09251·cs.DC·January 17, 2025

Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

Haisha Zhao, San Li, Jiaheng Wang, Chunbao Zhou, Jue Wang, Zhikuang, Xin, Shunde Li, Zhiqiang Liang, Zhijie Pan, Fang Liu, Yan Zeng, Yangang Wang,, Xuebin Chi

PDF

TL;DR

Acc-SpMM is a GPU library that significantly accelerates sparse matrix-matrix multiplication by leveraging Tensor Cores and various optimizations, outperforming existing solutions across multiple NVIDIA GPU architectures.

Contribution

The paper introduces Acc-SpMM, a novel high-performance SpMM library optimized for Tensor Cores with multiple techniques, achieving substantial speedups over cuSPARSE.

Findings

01

Achieves up to 5.11x speedup on RTX 4090

02

Average 2.52x speedup on RTX 4090

03

Outperforms cuSPARSE across diverse GPU architectures

Abstract

General-purpose Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel in scientific computing and deep learning. The emergence of new matrix computation units such as Tensor Cores (TCs) brings more opportunities for SpMM acceleration. However, in order to fully unleash the power of hardware performance, systematic optimization is required. In this paper, we propose Acc-SpMM, a high-performance SpMM library on TCs, with multiple optimizations, including data-affinity-based reordering, memory efficient compressed format, high-throughput pipeline, and adaptive sparsity-aware load balancing. In contrast to the state-of-the-art SpMM kernels on various NVIDIA GPU architectures with a diverse range of benchmark matrices, Acc-SpMM achieves significant performance improvements, on average 2.52x (up to 5.11x) speedup on RTX 4090, on average 1.91x (up to 4.68x) speedup on A800, and on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.