Shfl-BW: Accelerating Deep Neural Network Inference with Tensor-Core   Aware Weight Pruning

Guyue Huang; Haoran Li; Minghai Qin; Fei Sun; Yufei Ding; Yuan Xie

arXiv:2203.05016·cs.DC·March 15, 2022·1 cites

Shfl-BW: Accelerating Deep Neural Network Inference with Tensor-Core Aware Weight Pruning

Guyue Huang, Haoran Li, Minghai Qin, Fei Sun, Yufei Ding, Yuan Xie

PDF

Open Access

TL;DR

This paper introduces Shfl-BW, a novel weight pruning pattern that enables efficient tensor-core utilization in DNN inference, achieving significant speedups while maintaining accuracy.

Contribution

The paper proposes Shfl-BW, a flexible sparse pattern that balances accuracy and efficiency, optimized GPU kernels for tensor-core acceleration in DNNs.

Findings

01

Achieves up to 4.18x speedup on GPU layers with 75% sparsity.

02

Maintains high accuracy with minimal loss while accelerating Transformer layers.

03

Demonstrates state-of-the-art speed-accuracy trade-offs in GPU DNN inference.

Abstract

Weight pruning in deep neural networks (DNNs) can reduce storage and computation cost, but struggles to bring practical speedup to the model inference time. Tensor-cores can significantly boost the throughput of GPUs on dense computation, but exploiting tensor-cores for sparse DNNs is very challenging. Compared to existing CUDA-cores, tensor-cores require higher data reuse and matrix-shaped instruction granularity, both difficult to yield from sparse DNN kernels. Existing pruning approaches fail to balance the demands of accuracy and efficiency: random sparsity preserves the model quality well but prohibits tensor-core acceleration, while highly-structured block-wise sparsity can exploit tensor-cores but suffers from severe accuracy loss. In this work, we propose a novel sparse pattern, Shuffled Block-wise sparsity (Shfl-BW), designed to efficiently utilize tensor-cores while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques