DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs
Xiabin Zhou, Wenbin Wang, Minyan Zeng, Jiaxian Guo, Xuebo Liu, Li Shen, Min Zhang, Liang Ding

TL;DR
DynamicKV introduces a task-aware, adaptive KV cache compression method for long-context LLMs, significantly reducing cache size while maintaining high performance across various tasks.
Contribution
It proposes a novel adaptive KV cache management strategy that dynamically adjusts token retention per layer based on task-specific activation patterns.
Findings
Retains only 1.7% of KV cache size with ~85% performance
Outperforms SOTA methods by 11% in Needle-in-a-Haystack test
Achieves efficient long-context processing with minimal cache usage
Abstract
Efficient KV cache management in LLMs is crucial for long-context tasks like RAG and summarization. Existing KV cache compression methods enforce a fixed pattern, neglecting task-specific characteristics and reducing the retention of essential information. However, we observe distinct activation patterns across layers in various tasks, highlighting the need for adaptive strategies tailored to each task's unique demands. Based on this insight, we propose DynamicKV, a method that dynamically optimizes token retention by adjusting the number of tokens retained at each layer to adapt to the specific task. DynamicKV establishes global and per-layer maximum KV cache budgets, temporarily retaining the maximum budget for the current layer, and periodically updating the KV cache sizes of all preceding layers during inference. Our method retains only 1.7% of the KV cache size while achieving ~85%…
Peer Reviews
Decision·Submitted to ICLR 2025
a. The argument of dynamic cache budget across different layers is sound and consistent with the findings of many recent literature. b. Comprehensive model and baseline coverage for longbench evaluation.
a. Doesn't seem to support FlashAttention by the look of Eq 1. b. No proper efficiency evaluation. c. Dataset-wise, LongBench is too short to be utilized as the only long context evaluation. Please consider adding coverage of infinitybench and ruler, with a more long-context table model like llama 3.1/3.2. d. I am interested in comparing DynamicKV with some newer head-based methods, such as Ada-KV and MInference. e. The main argument of the DynamicKV is different tasks might prefer a
- The experiment is solid - The problem of finding more efficient architectures for transformers is relevant and not saturated
- The result of Needle in a Haystack is only on one model, maybe more models are better - For the Needle in a Haystack task, it might be better to test longer with a model that can support a longer context window like InternLM-2.5-7B-Chat-1M or Llama-3-8B-Instruct-Gradient-1048k
- The paper tackles an important and timely problem. - Interesting observation that different tasks exhibit different KV cache patterns. - Promising accuracy results under high compression ratio.
- The technical novelty is limited. Dynamic layer-wise KV cache allocation has been explored before, such as https://arxiv.org/pdf/2406.13035 and https://arxiv.org/pdf/2405.14366. It would be better if the authors can discuss and compare the proposed method with existing ones. Also, many of the optimizations proposed by this work are from prior studies, e.g., pooling from SnapKV and layer-wise allocation from pyramidKV. - Significantly increased tuning cost. The method is not easy to use, it r
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Caching and Content Delivery · Algorithms and Data Compression
MethodsLinear Layer · Multi-Head Attention · Residual Connection · Adam · Layer Normalization · Weight Decay · Softmax · WordPiece · Attention Dropout · Attention Is All You Need
