Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

Zijing Gu

arXiv:2007.13055·cs.MS·July 28, 2020·1 cites

Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

Zijing Gu

PDF

Open Access 1 Repo

TL;DR

This paper presents an optimized approach for block-sparse matrix multiplication on CUDA using TVM, achieving high performance through automatic tuning and efficient code generation.

Contribution

It introduces a novel method leveraging TVM for optimizing block-sparse matrix multiplication on CUDA, with automatic parameter tuning for improved efficiency.

Findings

01

Achieved competitive or superior performance compared to existing frameworks.

02

Demonstrated effective use of TVM's schedule space exploration and auto-tuning.

03

Provided a scalable solution for block-sparse matrix operations on GPUs.

Abstract

We implemented and optimized matrix multiplications between dense and block-sparse matrices on CUDA. We leveraged TVM, a deep learning compiler, to explore the schedule space of the operation and generate efficient CUDA code. With the automatic parameter tuning in TVM, our cross-thread reduction based implementation achieved competitive or better performance compared with other state-of-the-art frameworks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ceruleangu/Block-Sparse-Benchmark
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Sparse and Compressive Sensing Techniques