KVSculpt: KV Cache Compression as Distillation
Bo Jiang, Sian Jin

TL;DR
KVSculpt introduces a novel method for KV cache compression in long-context LLM inference by optimizing a small set of unconstrained KV pairs in continuous space, significantly reducing divergence.
Contribution
It proposes a new approach that moves beyond selection or merging, using continuous optimization and adaptive budget allocation for improved cache compression.
Findings
Reduces KL divergence by 3.5-4.1x compared to previous methods.
Adaptive budget allocation improves compression by 1.3x without extra inference cost.
Demonstrates non-uniform compression difficulty across layers and heads.
Abstract
KV cache compression is critical for efficient long-context LLM inference. Approaches that reduce the per-pair footprint -- quantization and low-rank decomposition -- are orthogonal to those that reduce the sequence length of the cache. Along the sequence-length dimension, existing methods range from pure eviction -- selecting which KV pairs to keep -- to merging, which combines similar pairs into fewer ones. Both remain anchored to the original cache entries. We propose KVSculpt, which moves to the other end of this spectrum: instead of selecting or combining original pairs, we optimize a smaller set of unconstrained KV pairs in continuous embedding space to preserve each layer's attention behavior. Keys are optimized via L-BFGS and values are solved in closed form via least squares, alternating every few steps. On top of this, we introduce adaptive budget allocation, which uses a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
