Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng, Gao

TL;DR
This paper presents an adaptive KV cache compression method for LLMs that reduces memory usage during inference by selectively discarding context tokens based on attention head profiling, without retraining.
Contribution
It introduces a novel adaptive KV cache construction technique guided by lightweight attention profiling, enabling memory-efficient inference without fine-tuning.
Findings
Significant GPU memory reduction during inference
Negligible impact on generation quality
Compatible with existing LLM inference pipelines
Abstract
In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs). Different from the conventional KV cache that retains key and value vectors for all context tokens, we conduct targeted profiling to discern the intrinsic structure of attention modules. Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens. Moreover, with the lightweight attention profiling used to guide the construction of the adaptive KV cache, FastGen can be deployed without resource-intensive fine-tuning or re-training. In our…
Peer Reviews
Decision·ICLR 2024 oral
- The paper introduces valuable insights drawn from LLMs: 1. Different structure in different attention 2. The same head structures persist. These insights are well-supported with empirical data and references to existing literature. - The authors leverage these insights to come up with an effective compression method that adapts to the structure of each attention head. The results show consistent compression rate and model quality improvement over prior SoTA fixed compression mechanisms.
- The paper could benefit from presenting actual GPU inference performance results using FastGen and comparing them with other compression methods. Additionally, providing a runtime breakdown would offer more insights into the overhead caused by the profiling, compression, and decompression processes. - It would be nice to look into the structure of KV in the multi-query attention design.
- This paper solves a critical research problem about efficient LLM inference with advanced algorithm design. The designed algorithm is straightforward and effective. - The presentation of the technical discussion is accurate and well-organized. - The organization of the evaluation sections is clear, and the presented results show the advance and efficiency of the proposed method.
- Based on my understanding, the proposed algorithm specializes in the most classic softmax-based attention. Is it possible to include a small section discussing the limitations of the proposed algorithm for more complicated attention mechanisms and some preliminary ideas about supporting those mechanisms in the future? - Given the scale of the benchmarked model (llama-70B fp16 on A100-80G), I guess there is a missing detail about the parallel strategies applied in the experiments.
- Having an adaptive KV cache for each of the attention module type is a really interesting and exciting idea. - No fine-tuning costs of the proposed method is commendable. - The paper clearly positions within the body of existing literature, by distinguishing the proposed method as an adaptive and a diverse set of eviction strategies. - The paper is clearly written, the presentation is great, easy to follow along and digest the concepts.
- Although, the idea of adaptive KV cache compression sounds interesting, what is the overhead of book-keeping to support this adaptive and diverse ability based on the type of the attention? This is not discussed anywhere in the paper? - That is, each layer id will be mapped to a eviction policy and is deployed with the model at hand. - Next, what is the added computational complexity both asymptotically as well experimentally. - Table 3 shows an ablation on the policy order, why is this n
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
