Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators
Paolo D'Alberto, Taehee Jeong, Akshai Jain, Shreyas Manjunath, Mrinal, Sarmah, Samuel Hsu, Yaswanth Raparti, Nitesh Pipralia

TL;DR
This paper introduces weight block sparsity for DNNs, enabling efficient hardware-friendly sparsity that accelerates inference, reduces memory, and maintains accuracy, demonstrated on multiple models and hardware configurations.
Contribution
It presents a system for training and exploiting 8x8 weight block sparsity on GPUs, with compiler support for data compression and computation, enhancing inference speed and efficiency.
Findings
Halved model weights with minimal accuracy loss
Achieved 2x faster inference on ResNet50
Demonstrated hardware-software synergy for FPGA deployment
Abstract
Nowadays, increasingly larger Deep Neural Networks (DNNs) are being developed, trained, and utilized. These networks require significant computational resources, putting a strain on both advanced and limited devices. Our solution is to implement {\em weight block sparsity}, which is a structured sparsity that is friendly to hardware. By zeroing certain sections of the convolution and fully connected layers parameters of pre-trained DNN models, we can efficiently speed up the DNN's inference process. This results in a smaller memory footprint, faster communication, and fewer operations. Our work presents a vertical system that allows for the training of convolution and matrix multiplication weights to exploit 8x8 block sparsity on a single GPU within a reasonable amount of time. Compilers recognize this sparsity and use it for both data compaction and computation splitting into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsManufacturing Process and Optimization
MethodsConvolution · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
