FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix   Multiplications on Tensor Cores

Jinliang Shi; Shigang Li; Youxuan Xu; Rongtian Fu; Xueying Wang; Tong; Wu

arXiv:2412.11007·cs.DC·December 17, 2024

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

Jinliang Shi, Shigang Li, Youxuan Xu, Rongtian Fu, Xueying Wang, Tong, Wu

PDF

Open Access

TL;DR

FlashSparse is a novel method that reduces computation redundancy and enhances the performance of sparse matrix multiplications on Tensor Cores by minimizing sparse granularity and optimizing data access.

Contribution

It introduces a swap-and-transpose strategy and memory-efficient mappings to significantly improve sparse matrix multiplication speed on modern accelerators.

Findings

01

Achieves 5.5x speedup over DTC-SpMM

02

Achieves 3.22x speedup over RoDe

03

Sets new state-of-the-art performance on GPUs

Abstract

Sparse Matrix-matrix Multiplication (SpMM) and Sampled Dense-dense Matrix Multiplication (SDDMM) are important sparse operators in scientific computing and deep learning. Tensor Core Units (TCUs) enhance modern accelerators with superior computing power, which is promising to boost the performance of matrix operators to a higher level. However, due to the irregularity of unstructured sparse data, it is difficult to deliver practical speedups on TCUs. To this end, we propose FlashSparse, a novel approach to bridge the gap between sparse workloads and the TCU architecture. Specifically, FlashSparse minimizes the sparse granularity for SpMM and SDDMM on TCUs through a novel swap-and-transpose matrix multiplication strategy. Benefiting from the minimum sparse granularity, the computation redundancy is remarkably reduced while the computing power of TCUs is fully utilized. Besides,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTensor decomposition and applications · Parallel Computing and Optimization Techniques · Algorithms and Data Compression