Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators

Paolo D'Alberto; Taehee Jeong; Akshai Jain; Shreyas Manjunath; Mrinal; Sarmah; Samuel Hsu; Yaswanth Raparti; Nitesh Pipralia

arXiv:2407.09453·cs.LG·July 16, 2024·1 cites

Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators

Paolo D'Alberto, Taehee Jeong, Akshai Jain, Shreyas Manjunath, Mrinal, Sarmah, Samuel Hsu, Yaswanth Raparti, Nitesh Pipralia

PDF

Open Access

TL;DR

This paper introduces weight block sparsity for DNNs, enabling efficient hardware-friendly sparsity that accelerates inference, reduces memory, and maintains accuracy, demonstrated on multiple models and hardware configurations.

Contribution

It presents a system for training and exploiting 8x8 weight block sparsity on GPUs, with compiler support for data compression and computation, enhancing inference speed and efficiency.

Findings

01

Halved model weights with minimal accuracy loss

02

Achieved 2x faster inference on ResNet50

03

Demonstrated hardware-software synergy for FPGA deployment

Abstract

Nowadays, increasingly larger Deep Neural Networks (DNNs) are being developed, trained, and utilized. These networks require significant computational resources, putting a strain on both advanced and limited devices. Our solution is to implement {\em weight block sparsity}, which is a structured sparsity that is friendly to hardware. By zeroing certain sections of the convolution and fully connected layers parameters of pre-trained DNN models, we can efficiently speed up the DNN's inference process. This results in a smaller memory footprint, faster communication, and fewer operations. Our work presents a vertical system that allows for the training of convolution and matrix multiplication weights to exploit 8x8 block sparsity on a single GPU within a reasonable amount of time. Compilers recognize this sparsity and use it for both data compaction and computation splitting into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsManufacturing Process and Optimization

MethodsConvolution · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings