Efficient VQ-QAT and Mixed Vector/Linear quantized Neural Networks

Terry Gou; Puneet Gupta

arXiv:2604.23172·cs.LG·April 28, 2026

Efficient VQ-QAT and Mixed Vector/Linear quantized Neural Networks

Terry Gou, Puneet Gupta

PDF

TL;DR

This paper introduces three vector quantization techniques for neural network weight compression, utilizing cosine similarity and neural architecture search to improve end-to-end training and layer-wise optimization.

Contribution

The work presents novel VQ methods with cosine similarity-based assignment and NAS-driven layer configuration, advancing model compression strategies.

Findings

01

The proposed methods offer insights into VQ-based compression trade-offs.

02

Cosine similarity improves assignment in vector quantization.

03

NAS helps optimize layer-wise quantization configurations.

Abstract

In this work, we developed and tested 3 techniques for vector quantization (VQ) based model weight compression. To mitigate codebook collapse and enable end-to-end training, we adopted cosine similarity-based assignment. Building on ideas from attention-based formulations in Differentiable K-Means (DKM), we further improved this approach by using cosine similarity for assignment combined with top-1 sampling and a straight-through estimator, thereby eliminating the need for weighted-average reconstruction. Finally, we investigated the use of differentiable neural architecture search (NAS) to adaptively select layer-wise quantization configurations, further optimizing the compression process. Although our method does not consistently outperform existing approaches across all quantization levels, it provides useful insights into the design trade-offs and behaviors of VQ-based model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.