Efficient Quantized Sparse Matrix Operations on Tensor Cores

Shigang Li; Kazuki Osawa; Torsten Hoefler

arXiv:2209.06979·cs.DC·May 9, 2023·1 cites

Efficient Quantized Sparse Matrix Operations on Tensor Cores

Shigang Li, Kazuki Osawa, Torsten Hoefler

PDF

Open Access 1 Repo

TL;DR

Magicube is a high-performance library that accelerates sparse, low-precision matrix operations on Tensor cores, significantly improving speed for deep learning models with minimal accuracy loss.

Contribution

It introduces Magicube, a novel library supporting sparse matrix operations on Tensor cores with low-precision integers, achieving substantial speedups.

Findings

01

Achieves 1.44x average speedup over vendor libraries

02

Up to 2.37x speedup in experiments

03

Maintains comparable accuracy in sparse Transformer inference

Abstract

The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate the problem. From the architecture perspective, hardware vendors provide Tensor cores for acceleration. However, it is very challenging to gain practical speedups from sparse, low-precision matrix operations on Tensor cores, because of the strict requirements for data layout and lack of support for efficiently manipulating the low-precision integers. We propose Magicube, a high-performance sparse-matrix library for low-precision integers on Tensor cores. Magicube supports SpMM and SDDMM, two major sparse operations in deep learning with mixed precision. Experimental results on an NVIDIA A100 GPU show that Magicube achieves on average 1.44x (up to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shigangli/magicube
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Tensor decomposition and applications · Parallel Computing and Optimization Techniques

MethodsLib · Attention Is All You Need · Linear Layer · Attention Dropout · Position-Wise Feed-Forward Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Byte Pair Encoding · Linear Warmup With Cosine Annealing · Adam