HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

Jorge L. Ruiz Williams

arXiv:2605.03562·cs.LG·May 21, 2026

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

Jorge L. Ruiz Williams

PDF

TL;DR

HeadQ introduces a novel method for KV-cache quantization that corrects model-visible score errors and reduces perplexity in large language models by leveraging score-space error predictions.

Contribution

The paper proposes HeadQ, a new quantization correction technique that models score errors and improves storage efficiency and model performance.

Findings

01

HeadQ removes 84-94% of excess perplexity in 2-bit quantization.

02

Score-space error predicts attention KL better than key MSE.

03

HeadQ improves performance in full-KV 2-bit experiments across six models.

Abstract

KV-cache quantizers usually optimize storage-space reconstruction, even though attention reads keys through logits and values through attention-weighted readout. We argue that persistent cache error should be measured in model-visible coordinates. For keys, the visible object is score error modulo constant shifts; this yields HeadQ, a key-side method that stores a low-rank residual side code in a calibration-learned query basis and applies it as an additive logit correction. For values, fixed-attention readout gives an $A^{2}$ -weighted token-distortion surrogate. Across six models, Fisher/score-space error predicts attention KL far better than raw key MSE; same-budget counterexamples, null-space interventions, query-PCA controls, and wrong-sign HeadQ falsify storage-MSE alternatives. Matched Pythia checkpoints localize the main anomaly to a small-model low-entropy route-flip boundary. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.