Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Yuan Feng; Junlin Lv; Yukun Cao; Xike Xie; S. Kevin Zhou

arXiv:2407.11550·cs.CL·October 17, 2025

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou

PDF

Open Access 2 Repos

TL;DR

Ada-KV introduces a head-wise adaptive cache eviction strategy for large language models, significantly improving inference efficiency by tailoring cache management to attention head patterns, outperforming existing uniform approaches.

Contribution

This paper presents Ada-KV, the first adaptive, head-wise cache eviction method guided by a theoretical loss bound, enhancing LLM inference efficiency and quality.

Findings

01

Significant quality improvements over existing methods.

02

Effective integration with prior cache eviction techniques.

03

Validated on multiple datasets with diverse scenarios.

Abstract

Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache required for long-sequence inference. Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime while preserving generation quality. However, these methods typically allocate compression budgets uniformly across all attention heads, ignoring the unique attention patterns of each head. In this paper, we establish a theoretical loss upper bound between pre- and post-eviction attention output, explaining the optimization target of prior cache eviction methods, while guiding the optimization of adaptive budget allocation. Base on this, we propose {\it Ada-KV}, the first head-wise adaptive budget allocation strategy. It offers plug-and-play benefits, enabling seamless integration with prior cache eviction methods. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Distributed and Parallel Computing Systems · Advanced Data Compression Techniques

MethodsSoftmax · Attention Is All You Need · Balanced Selection