KVSculpt: KV Cache Compression as Distillation

Bo Jiang; Sian Jin

arXiv:2603.27819·cs.LG·March 31, 2026

KVSculpt: KV Cache Compression as Distillation

Bo Jiang, Sian Jin

PDF

TL;DR

KVSculpt introduces a novel method for KV cache compression in long-context LLM inference by optimizing a small set of unconstrained KV pairs in continuous space, significantly reducing divergence.

Contribution

It proposes a new approach that moves beyond selection or merging, using continuous optimization and adaptive budget allocation for improved cache compression.

Findings

01

Reduces KL divergence by 3.5-4.1x compared to previous methods.

02

Adaptive budget allocation improves compression by 1.3x without extra inference cost.

03

Demonstrates non-uniform compression difficulty across layers and heads.

Abstract

KV cache compression is critical for efficient long-context LLM inference. Approaches that reduce the per-pair footprint -- quantization and low-rank decomposition -- are orthogonal to those that reduce the sequence length of the cache. Along the sequence-length dimension, existing methods range from pure eviction -- selecting which KV pairs to keep -- to merging, which combines similar pairs into fewer ones. Both remain anchored to the original cache entries. We propose KVSculpt, which moves to the other end of this spectrum: instead of selecting or combining original pairs, we optimize a smaller set of unconstrained KV pairs in continuous embedding space to preserve each layer's attention behavior. Keys are optimized via L-BFGS and values are solved in closed form via least squares, alternating every few steps. On top of this, we introduce adaptive budget allocation, which uses a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.