Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM
Zijing Gu

TL;DR
This paper presents an optimized approach for block-sparse matrix multiplication on CUDA using TVM, achieving high performance through automatic tuning and efficient code generation.
Contribution
It introduces a novel method leveraging TVM for optimizing block-sparse matrix multiplication on CUDA, with automatic parameter tuning for improved efficiency.
Findings
Achieved competitive or superior performance compared to existing frameworks.
Demonstrated effective use of TVM's schedule space exploration and auto-tuning.
Provided a scalable solution for block-sparse matrix operations on GPUs.
Abstract
We implemented and optimized matrix multiplications between dense and block-sparse matrices on CUDA. We leveraged TVM, a deep learning compiler, to explore the schedule space of the operation and generate efficient CUDA code. With the automatic parameter tuning in TVM, our cross-thread reduction based implementation achieved competitive or better performance compared with other state-of-the-art frameworks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Sparse and Compressive Sensing Techniques
