Accelerating Sparse Approximate Matrix Multiplication on GPUs

Xiaoyan Liu; Yi Liu; Ming Dun; Bohong Yin; Hailong Yang; Zhongzhi; Luan; Depei Qian

arXiv:2103.13042·cs.PF·October 25, 2022

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Xiaoyan Liu, Yi Liu, Ming Dun, Bohong Yin, Hailong Yang, Zhongzhi, Luan, Depei Qian

PDF

Open Access

TL;DR

This paper introduces cuSpAMM, a GPU-optimized parallel algorithm for sparse approximate matrix multiplication, achieving significant speedups over existing libraries by leveraging novel optimizations and multi-GPU scaling.

Contribution

The paper presents the first GPU-optimized parallel SpAMM algorithm with new performance optimizations and multi-GPU scaling capabilities.

Findings

01

Achieves significant speedup over cuBLAS and cuSPARSE.

02

Effectively scales across multiple GPUs with load balancing.

03

Demonstrates improved performance on real-world datasets.

Abstract

Although the matrix multiplication plays a vital role in computational linear algebra, there are few efficient solutions for matrix multiplication of the near-sparse matrices. The Sparse Approximate Matrix Multiply (SpAMM) is one of the algorithms to fill the performance gap neglected by traditional optimizations for dense/sparse matrix multiplication. However, existing SpAMM algorithms fail to exploit the performance potential of GPUs for acceleration. In this paper, we present cuSpAMM, the first parallel SpAMM algorithm optimized for multiple GPUs. Several performance optimizations have been proposed, including algorithm re-design to adapt to the thread parallelism, blocking strategies for memory access optimization, and the acceleration with the tensor core. In addition, we scale cuSpAMM to run on multiple GPUs with an effective load balance scheme. We evaluate cuSpAMM on both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTensor decomposition and applications · Parallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques