DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs

Xiabin Zhou; Wenbin Wang; Minyan Zeng; Jiaxian Guo; Xuebo Liu; Li Shen; Min Zhang; Liang Ding

arXiv:2412.14838·cs.CL·May 28, 2025

DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs

Xiabin Zhou, Wenbin Wang, Minyan Zeng, Jiaxian Guo, Xuebo Liu, Li Shen, Min Zhang, Liang Ding

PDF

Open Access 3 Reviews

TL;DR

DynamicKV introduces a task-aware, adaptive KV cache compression method for long-context LLMs, significantly reducing cache size while maintaining high performance across various tasks.

Contribution

It proposes a novel adaptive KV cache management strategy that dynamically adjusts token retention per layer based on task-specific activation patterns.

Findings

01

Retains only 1.7% of KV cache size with ~85% performance

02

Outperforms SOTA methods by 11% in Needle-in-a-Haystack test

03

Achieves efficient long-context processing with minimal cache usage

Abstract

Efficient KV cache management in LLMs is crucial for long-context tasks like RAG and summarization. Existing KV cache compression methods enforce a fixed pattern, neglecting task-specific characteristics and reducing the retention of essential information. However, we observe distinct activation patterns across layers in various tasks, highlighting the need for adaptive strategies tailored to each task's unique demands. Based on this insight, we propose DynamicKV, a method that dynamically optimizes token retention by adjusting the number of tokens retained at each layer to adapt to the specific task. DynamicKV establishes global and per-layer maximum KV cache budgets, temporarily retaining the maximum budget for the current layer, and periodically updating the KV cache sizes of all preceding layers during inference. Our method retains only 1.7% of the KV cache size while achieving ~85%…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

a. The argument of dynamic cache budget across different layers is sound and consistent with the findings of many recent literature. b. Comprehensive model and baseline coverage for longbench evaluation.

Weaknesses

a. Doesn't seem to support FlashAttention by the look of Eq 1. b. No proper efficiency evaluation. c. Dataset-wise, LongBench is too short to be utilized as the only long context evaluation. Please consider adding coverage of infinitybench and ruler, with a more long-context table model like llama 3.1/3.2. d. I am interested in comparing DynamicKV with some newer head-based methods, such as Ada-KV and MInference. e. The main argument of the DynamicKV is different tasks might prefer a

Reviewer 02Rating 6Confidence 4

Strengths

- The experiment is solid - The problem of finding more efficient architectures for transformers is relevant and not saturated

Weaknesses

- The result of Needle in a Haystack is only on one model, maybe more models are better - For the Needle in a Haystack task, it might be better to test longer with a model that can support a longer context window like InternLM-2.5-7B-Chat-1M or Llama-3-8B-Instruct-Gradient-1048k

Reviewer 03Rating 5Confidence 5

Strengths

- The paper tackles an important and timely problem. - Interesting observation that different tasks exhibit different KV cache patterns. - Promising accuracy results under high compression ratio.

Weaknesses

- The technical novelty is limited. Dynamic layer-wise KV cache allocation has been explored before, such as https://arxiv.org/pdf/2406.13035 and https://arxiv.org/pdf/2405.14366. It would be better if the authors can discuss and compare the proposed method with existing ones. Also, many of the optimizations proposed by this work are from prior studies, e.g., pooling from SnapKV and layer-wise allocation from pyramidKV. - Significantly increased tuning cost. The method is not easy to use, it r

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Caching and Content Delivery · Algorithms and Data Compression

MethodsLinear Layer · Multi-Head Attention · Residual Connection · Adam · Layer Normalization · Weight Decay · Softmax · WordPiece · Attention Dropout · Attention Is All You Need