VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization
Yixuan Wang, Qingyu Shi, Jiayu Zhou, Dianbo Liu, Ziwei He, Zhouhan Lin

TL;DR
VQKV is a training-free vector quantization method that significantly compresses Key-Value caches in large language models, enabling longer context processing with minimal performance loss.
Contribution
It introduces a novel vector quantization approach for cache compression that achieves high ratios without training, maintaining model fidelity.
Findings
Achieves 82.8% compression ratio on LLaMA3.1-8B.
Retains 98.6% of baseline performance on LongBench.
Enables 4.3x longer generation length.
Abstract
The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8\% compression ratio on LLaMA3.1-8B while retaining 98.6\% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Applying vector quantization at the vector level (vs. per-scalar or token eviction) is a reasonable design to preserve intra-vector structure. 2. Insightful observation on periodic reconstruction errors in key dimensions, leading to a two-branch VQ for low/high “frequency” subspaces. 3. Ablations on codebook number/size clarify the capacity–compression trade-off and provide tuning guidance.
1. The method trains RSimVQ codebooks on ~10M tokens; this is not training-free which can affect the paper positioning. 2. Reconstruction adds matmul/lookups; decoding quantization is batched every $L_{\text{local}}$ but still incurs periodic overhead. There is no wall-clock comparison vs. widely-used baselines under FlashAttention-2/vLLM pipelines. 3. Baseline methods are fixed to particular ratios/knobs (e.g., 4-bit KIVI, SnapKV middle-token recall) that may not match VQKV’s effective memory
1. Their experiments are well presented and compared to lots of baselines, showing the effectiveness of vector quantization for KV cache. 2. Apart from quantization quality, the authors also discuss the efficiency of their method, which is an appropriate and important consideration. Since the primary motivation for quantizing the KV cache is to reduce memory usage and latency, the method is only meaningful if the computational overhead introduced by vector quantization does not outweigh the lat
I believe this work still has a lot of room for improvement and is currently not good enough, based on these reasons: 1. The claimed main contribution, vector quantization, is already broadly explored by prior works [1,2,3] and is not novel. The authors have made some improvements, for example, to apply two codebooks for low- and high-frequency components of the key cache, considering the impact of RoPE, though this improvement is not critical and is not well justified. For example, is it bette
* Applies vector quantization to increase the compression ratio of the KV cache, targeting substantial memory footprint reduction. * Keys are more sensitive to quantization error than Values, so the method introduces two codebooks for the Key cache to mitigate this issue.
- The main concern is end-to-end throughput in real systems. While the paper reports memory savings (compression ratios), it provides no latency/throughput evaluation. In memory-bound LLM inference, reduced footprint can correlate with higher throughput, but this is not guaranteed. Practical speedups depend on kernel design, cache behavior, and data movement. - Table 4 lists codebook configurations, but efficient dequantization typically requires the codebooks to reside in a cache that can be re
1. The evaluation results are good. VQKV evaluated on 3 well established long-context benchmark and got good performance at a high compression ratio.
1. The novelty is somewhat limited. Vector quantization on LLMs has already been explored in previous literatures. For example, on weights(QuiP#[1], AQLM[2]) or KV cache(VQLLM[3], CommVQ[4]). 2. Lack of efficiency evaluation. A major goal for compressing KV cache is to reduce the fetching time from GPU HBM(which is the bottleneck of decoding latency), while VQKV still has to fetch the entire KV cache from HBM thus likely provides no efficiency gain. Moreover, the paper does not discuss the effi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Cloud Computing and Resource Management
