Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou

TL;DR
Ada-KV introduces a head-wise adaptive cache eviction strategy for large language models, significantly improving inference efficiency by tailoring cache management to attention head patterns, outperforming existing uniform approaches.
Contribution
This paper presents Ada-KV, the first adaptive, head-wise cache eviction method guided by a theoretical loss bound, enhancing LLM inference efficiency and quality.
Findings
Significant quality improvements over existing methods.
Effective integration with prior cache eviction techniques.
Validated on multiple datasets with diverse scenarios.
Abstract
Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache required for long-sequence inference. Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime while preserving generation quality. However, these methods typically allocate compression budgets uniformly across all attention heads, ignoring the unique attention patterns of each head. In this paper, we establish a theoretical loss upper bound between pre- and post-eviction attention output, explaining the optimization target of prior cache eviction methods, while guiding the optimization of adaptive budget allocation. Base on this, we propose {\it Ada-KV}, the first head-wise adaptive budget allocation strategy. It offers plug-and-play benefits, enabling seamless integration with prior cache eviction methods. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed and Parallel Computing Systems · Advanced Data Compression Techniques
MethodsSoftmax · Attention Is All You Need · Balanced Selection
