NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache

Donghyun Son; Euntae Choi; Sungjoo Yoo

arXiv:2505.18231·cs.LG·December 16, 2025

NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache

Donghyun Son, Euntae Choi, Sungjoo Yoo

PDF

Open Access 1 Video

TL;DR

NSNQuant is a novel calibration-free vector quantization method for KV caches in large language models, using a three-step normalization and Hadamard transform to enable robust low-bit compression and significantly improve inference efficiency.

Contribution

It introduces NSNQuant, a calibration-free VQ technique with a three-step normalization process and Hadamard transform, improving robustness and efficiency in KV cache compression.

Findings

01

Outperforms prior methods in 1-bit and 2-bit settings.

02

Achieves up to 3× throughput gain over full-precision models.

03

Demonstrates strong generalization across different models and datasets.

Abstract

Large Language Model (LLM) inference is typically memory-intensive, especially when processing large batch sizes and long sequences, due to the large size of key-value (KV) cache. Vector Quantization (VQ) is recently adopted to alleviate this issue, but we find that the existing approach is susceptible to distribution shift due to its reliance on calibration datasets. To address this limitation, we introduce NSNQuant, a calibration-free Vector Quantization (VQ) technique designed for low-bit compression of the KV cache. By applying a three-step transformation-1) a token-wise normalization (Normalize), 2) a channel-wise centering (Shift), and 3) a second token-wise normalization (Normalize)-with Hadamard transform, NSNQuant effectively aligns the token distribution with the standard normal distribution. This alignment enables robust, calibration-free vector quantization using a single…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache· slideslive

Taxonomy

TopicsAdvanced Data Compression Techniques · Error Correcting Code Techniques · Algorithms and Data Compression