Training Transformers for KV Cache Compressibility
Yoav Gelberg, Yam Eitan, Michael Bronstein, Yarin Gal, Haggai Maron

TL;DR
This paper introduces KV-CAT, a training method that encourages transformers to learn representations that are more compressible, improving long-context language modeling efficiency.
Contribution
It formalizes KV compressibility as a property of learned representations and proposes a training procedure to enhance this property in transformers.
Findings
KV-CAT improves downstream compression quality.
It enhances the tradeoff between compression and model performance.
The method benefits retrieval, long-context QA, and perplexity-based tasks.
Abstract
Long-context language modeling is increasingly constrained by the Key-Value (KV) cache, whose memory and decode-time access costs scale linearly with the prefix length. This bottleneck has motivated a range of context-compression methods, from token-level summarization to recent optimization-based KV compression methods. These post-hoc methods operate on the KV cache of a fixed pretrained model, so their effectiveness is fundamentally limited by how well the model's internal representations can be compressed. In this work, we formalize the notion of KV compressibility and show that it is a property of the learned representations, rather than of the context alone. We prove that almost any sequence-to-vector function admits both highly compressible and inherently non-compressible transformer implementations, highlighting the need to guide transformers toward compressible representations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
