MVQ:Towards Efficient DNN Compression and Acceleration with Masked Vector Quantization
Shuaiting Li, Chengxuan Wang, Juncan Deng, Zeyu Wang, Zewen Ye,, Zongsheng Wang, Haibin Shen, Kejie Huang

TL;DR
MVQ introduces a novel vector quantization method with pruning and masked k-means to improve DNN compression and acceleration, achieving higher accuracy, energy efficiency, and reduced hardware size.
Contribution
The paper proposes MVQ, a new approach combining pruning and masked k-means for better weight approximation and efficient hardware implementation for DNN acceleration.
Findings
Outperforms conventional VQ in accuracy at similar compression ratios.
Reduces FLOPs and enhances energy efficiency in DNN inference.
Achieves 2.3× energy efficiency boost and 55% smaller systolic array in ASIC.
Abstract
Vector quantization(VQ) is a hardware-friendly DNN compression method that can reduce the storage cost and weight-loading datawidth of hardware accelerators. However, conventional VQ techniques lead to significant accuracy loss because the important weights are not well preserved. To tackle this problem, a novel approach called MVQ is proposed, which aims at better approximating important weights with a limited number of codewords. At the algorithm level, our approach removes the less important weights through N:M pruning and then minimizes the vector clustering error between the remaining weights and codewords by the masked k-means algorithm. Only distances between the unpruned weights and the codewords are computed, which are then used to update the codewords. At the architecture level, our accelerator implements vector quantization on an EWS (Enhanced weight stationary) CNN…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsBalanced Selection · Pruning
