NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache
Donghyun Son, Euntae Choi, Sungjoo Yoo

TL;DR
NSNQuant is a novel calibration-free vector quantization method for KV caches in large language models, using a three-step normalization and Hadamard transform to enable robust low-bit compression and significantly improve inference efficiency.
Contribution
It introduces NSNQuant, a calibration-free VQ technique with a three-step normalization process and Hadamard transform, improving robustness and efficiency in KV cache compression.
Findings
Outperforms prior methods in 1-bit and 2-bit settings.
Achieves up to 3× throughput gain over full-precision models.
Demonstrates strong generalization across different models and datasets.
Abstract
Large Language Model (LLM) inference is typically memory-intensive, especially when processing large batch sizes and long sequences, due to the large size of key-value (KV) cache. Vector Quantization (VQ) is recently adopted to alleviate this issue, but we find that the existing approach is susceptible to distribution shift due to its reliance on calibration datasets. To address this limitation, we introduce NSNQuant, a calibration-free Vector Quantization (VQ) technique designed for low-bit compression of the KV cache. By applying a three-step transformation-1) a token-wise normalization (Normalize), 2) a channel-wise centering (Shift), and 3) a second token-wise normalization (Normalize)-with Hadamard transform, NSNQuant effectively aligns the token distribution with the standard normal distribution. This alignment enables robust, calibration-free vector quantization using a single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Data Compression Techniques · Error Correcting Code Techniques · Algorithms and Data Compression
