Quantization Dominates Rank Reduction for KV-Cache Compression

Samuel Salfati

arXiv:2604.11501·cs.LG·April 14, 2026

Quantization Dominates Rank Reduction for KV-Cache Compression

Samuel Salfati

PDF

TL;DR

This paper demonstrates that quantization significantly outperforms rank reduction in compressing transformer KV caches, due to structural preservation of token scoring under softmax attention.

Contribution

It provides a formal analysis showing quantization's advantage over rank reduction, supported by empirical results across multiple models and tasks.

Findings

01

Quantization outperforms rank reduction by 4-364 PPL at matched storage budgets.

02

INT4 quantization maintains accuracy close to FP16, while rank-32 collapses to 0.4%.

03

Joint K+V INT4 quantization achieves 75% KV reduction with minimal PPL increase.

Abstract

We compare two strategies for compressing the KV cache in transformer inference: rank reduction (discard dimensions) and quantization (keep all dimensions, reduce precision). At matched storage budgets across five models (124M-14B, MHA and GQA), we find that quantization consistently outperforms rank reduction by 4-364 PPL depending on model and compression level. The gap persists even when rank reduction is combined with quantization in hybrid baselines, and it grows with GQA aggressiveness. On LAMBADA, INT4 matches FP16 accuracy (+0.23 PPL on Mistral 7B, +0.58 on GPT-2) while rank-32 at identical storage collapses to 0.4%. We trace this gap to a structural asymmetry: under softmax attention routing, removing a dimension can flip which token is attended (a discrete failure), while quantization noise is bounded and typically preserves score ordering. We formalize this via a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.