TL;DR
This paper introduces a theoretically grounded approach to quantize Key-Value caches in LLMs, prioritizing key precision to reduce memory while maintaining high accuracy.
Contribution
It provides two theorems linking key-value spectral properties to quantization error, guiding optimal bit allocation for memory efficiency.
Findings
Key projections have larger spectral norms than value matrices.
Prioritizing key precision reduces quantization error and preserves accuracy.
Key-favored quantization retains up to 98.3% accuracy with less memory.
Abstract
Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key projections systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
