Quantize What Counts: More for Keys, Less for Values

Mohsen Hariri; Alan Luo; Weicong Chen; Shaochen Zhong; Tianyi Zhang; Qifan Wang; Xia Hu; Xiaotian Han; Vipin Chaudhary

arXiv:2502.15075·cs.LG·May 12, 2026

Quantize What Counts: More for Keys, Less for Values

Mohsen Hariri, Alan Luo, Weicong Chen, Shaochen Zhong, Tianyi Zhang, Qifan Wang, Xia Hu, Xiaotian Han, Vipin Chaudhary

PDF

1 Repo

TL;DR

This paper introduces a theoretically grounded approach to quantize Key-Value caches in LLMs, prioritizing key precision to reduce memory while maintaining high accuracy.

Contribution

It provides two theorems linking key-value spectral properties to quantization error, guiding optimal bit allocation for memory efficiency.

Findings

01

Key projections have larger spectral norms than value matrices.

02

Prioritizing key precision reduces quantization error and preserves accuracy.

03

Key-favored quantization retains up to 98.3% accuracy with less memory.

Abstract

Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key projections systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mohsenhariri/spectral-kv
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.