Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression
Ruijie Miao, Zhiming Wang, Wang Li, Shiwei Wu, Shufan Liu, Yanbing Jiang, Tong Yang

TL;DR
This paper introduces MixedDimKV, a novel KV cache compression method that allocates dimensions at a granular level, significantly reducing memory usage while maintaining high accuracy in long-context transformer inference.
Contribution
It proposes a mixed-dimension allocation approach for KV caches and integrates head-level importance, outperforming prior methods in long-context benchmarks.
Findings
Outperforms prior KV cache compression methods without head importance profiling.
Achieves comparable performance to full attention with only 6.25% cache.
Maintains 100% accuracy at 50K context length with 0.26% cache usage.
Abstract
Key-value (KV) caching is widely used to accelerate transformer inference, but its memory cost grows linearly with input length, limiting long-context deployment. Existing token eviction methods reduce memory by discarding less important tokens, which can be viewed as a coarse form of dimensionality reduction that assigns each token either zero or full dimension. We propose MixedDimKV, a mixed-dimension KV cache compression method that allocates dimensions to tokens at a more granular level, and MixedDimKV-H, which further integrates head-level importance information. Experiments on long-context benchmarks show that MixedDimKV outperforms prior KV cache compression methods that do not rely on head-level importance profiling. When equipped with the same head-level importance information, MixedDimKV-H consistently outperforms HeadKV. Notably, our approach achieves comparable performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Cloud Computing and Resource Management
