TL;DR
This paper introduces a permutation-based approach combined with quantization and fine-tuning to improve neural network compression, especially for modern architectures, achieving significant accuracy retention across vision tasks.
Contribution
It proposes a novel permutation strategy for weights that enhances vector quantization efficiency, connecting to rate-distortion theory for better compression.
Findings
Achieves 40-70% gap reduction with uncompressed models
Improves compression for pointwise convolutions and linear layers
Enhances accuracy with annealed quantization
Abstract
Compressing large neural networks is an important step for their deployment in resource-constrained computational platforms. In this context, vector quantization is an appealing framework that expresses multiple parameters using a single code, and has recently achieved state-of-the-art network compression on a range of core vision and natural language processing tasks. Key to the success of vector quantization is deciding which parameter groups should be compressed together. Previous work has relied on heuristics that group the spatial dimension of individual convolutional filters, but a general solution remains unaddressed. This is desirable for pointwise convolutions (which dominate modern architectures), linear layers (which have no notion of spatial dimension), and convolutions (when more than one filter is compressed to the same codeword). In this paper we make the observation that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
