Quantization Dominates Rank Reduction for KV-Cache Compression
Samuel Salfati

TL;DR
This paper demonstrates that quantization significantly outperforms rank reduction in compressing transformer KV caches, due to structural preservation of token scoring under softmax attention.
Contribution
It provides a formal analysis showing quantization's advantage over rank reduction, supported by empirical results across multiple models and tasks.
Findings
Quantization outperforms rank reduction by 4-364 PPL at matched storage budgets.
INT4 quantization maintains accuracy close to FP16, while rank-32 collapses to 0.4%.
Joint K+V INT4 quantization achieves 75% KV reduction with minimal PPL increase.
Abstract
We compare two strategies for compressing the KV cache in transformer inference: rank reduction (discard dimensions) and quantization (keep all dimensions, reduce precision). At matched storage budgets across five models (124M-14B, MHA and GQA), we find that quantization consistently outperforms rank reduction by 4-364 PPL depending on model and compression level. The gap persists even when rank reduction is combined with quantization in hybrid baselines, and it grows with GQA aggressiveness. On LAMBADA, INT4 matches FP16 accuracy (+0.23 PPL on Mistral 7B, +0.58 on GPT-2) while rank-32 at identical storage collapses to 0.4%. We trace this gap to a structural asymmetry: under softmax attention routing, removing a dimension can flip which token is attended (a discrete failure), while quantization noise is bounded and typically preserves score ordering. We formalize this via a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
