Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant
Paolo D'Alberto

TL;DR
This paper analyzes three KV cache quantization schemes, revealing their relative performance across different budgets and distributions, and introduces statistical inference and information metrics to understand their practical differences.
Contribution
It provides a detailed analysis of KV cache quantization schemes, highlighting the conditions under which each scheme performs best and explaining the impact of Jensen's inequality on quantization errors.
Findings
KQV outperforms other schemes at a budget of 4 across all tested measures.
QKQV is consistently worse than KQV in KL divergence across budgets and distributions.
A budget-dependent crossover exists where QKQV outperforms KQV at certain budgets, revealing an open rate-distortion problem.
Abstract
We analyse three KV cache quantization schemes under a fair bit budget: \textbf{KV} (scalar MSE baseline), \textbf{KQV} (WHT + MSE on ; WHT + MSE + QJL on ), and \textbf{QKQV} (WHT + MSE + QJL on both). Starting from the Beta distribution on the hypersphere, we trace how QJL on inflates inner product variance by , which softmax amplifies nonlinearly via Jensen's inequality, and we present statistical inference and information metrics to highlight practical differences. Three empirical findings emerge. (1)~At (the practically dominant budget), KQV wins on every measure -- KL divergence, geometric error, and 6D distance -- across all distributions and ranks tested. (2)~The K--V asymmetry is unconditional: QKQV is consistently worse than KQV in KL divergence at every budget and distribution. (3)~A budget-dependent crossover exists: QKQV achieves better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
