Efficient VQ-QAT and Mixed Vector/Linear quantized Neural Networks
Terry Gou, Puneet Gupta

TL;DR
This paper introduces three vector quantization techniques for neural network weight compression, utilizing cosine similarity and neural architecture search to improve end-to-end training and layer-wise optimization.
Contribution
The work presents novel VQ methods with cosine similarity-based assignment and NAS-driven layer configuration, advancing model compression strategies.
Findings
The proposed methods offer insights into VQ-based compression trade-offs.
Cosine similarity improves assignment in vector quantization.
NAS helps optimize layer-wise quantization configurations.
Abstract
In this work, we developed and tested 3 techniques for vector quantization (VQ) based model weight compression. To mitigate codebook collapse and enable end-to-end training, we adopted cosine similarity-based assignment. Building on ideas from attention-based formulations in Differentiable K-Means (DKM), we further improved this approach by using cosine similarity for assignment combined with top-1 sampling and a straight-through estimator, thereby eliminating the need for weighted-average reconstruction. Finally, we investigated the use of differentiable neural architecture search (NAS) to adaptively select layer-wise quantization configurations, further optimizing the compression process. Although our method does not consistently outperform existing approaches across all quantization levels, it provides useful insights into the design trade-offs and behaviors of VQ-based model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
