In-context KV-Cache Eviction for LLMs via Attention-Gate
Zihao Zeng, Bokai Lin, Tianqi Hou, Hao Zhang, Zhijie Deng

TL;DR
This paper introduces Attention-Gate, a lightweight module for dynamic KV-Cache eviction in large language models, reducing memory bottlenecks and improving inference efficiency without significant overhead.
Contribution
It proposes a novel Attention-Gate mechanism that enables dynamic, tunable eviction of tokens in KV-Cache during LLM inference, adaptable to pre-trained models.
Findings
Reduces memory usage during inference.
Improves model efficiency and performance.
Minimal additional computational overhead.
Abstract
The KV-Cache technique has become the standard for the inference of large language models (LLMs). Yet, it is widely criticized that KV-Cache can become a bottleneck of the LLM inference system. This paper enables a novel dynamic KV-Cache eviction policy by injecting a lightweight module called Attention-Gate to the model. It accepts the global context as input and yields eviction flags for each token. The self-attention modules in the model proceed according to the flags and cache only a subset of the KV states for next token prediction. The Attention-Gates can yield various flags for different heads and layers and be easily tuned on top of a pre-trained LLM via continual pre-training or supervised fine-tuning. The computational and memory overhead introduced by Attention-Gates can be minimal. We empirically evaluate the proposed approach across multiple scenarios, showing that…
Peer Reviews
Decision·Submitted to ICLR 2025
- The design choices of the proposed method are clearly motivated. - Presentation is clear and easy-to-follow
- Questionable choices of the evaluation benchmarks. The paper motivates the importance of KV-cache reduction by stating that “KV-cache can become a bottleneck when dealing with large models and long-context queries.” However, to the best of my knowledge, none of the benchmarks in this paper is “long-context,” nor are the models “large.” Besides, I am not sure benchmarks such as RTE and COPA are the best candidates for properly evaluating these models’ performance, compared to more “up-to-date”
- Attention-Gate is technically sound and easy to implement. The experiment showed its performance in both continued pre-training and lightweight fine-tuning settings with clear visualization of Attention Maps. - Ablation studies are thoroughly conducted on the impact of the number/dimension of AG heads and the structure choice of AG.
- The baselines like Local, StreamingLLM, and H2O are weak. As the motivation of Attention-Gate mechanism is to address the attention bias issue, some follow-up works like NACL and A2SF with the same motivation should be included as the baselines to fully show the effectiveness of AG. - The additional training phase in Attention-Gate may influence the performance of the original LLama2-7B model, so the performance of LLama2-7B with continued training without AG mechanism should be reported.
1. Compared to traditional static or local dynamic eviction methods, Attention-Gate offers a flexible and adaptive approach to KV-Cache eviction in LLMs, enhancing cache management by adjusting to contextual needs. 2. Attention-Gate opts for the attention-like structure to collect contextual information, which can vary among attention heads/layers and be seamlessly plugged into pre-trained LLMs.
1. Limited architectural detail: The paper lacks a thorough explanation of the detailed design of Attention-Gate, particularly regarding its network architecture and computation equations. 2. Insufficient experimental analysis: The experiments would benefit from more comprehensive analysis to better illustrate the efficiency and effectiveness of Attention-Gate, providing a clearer understanding of its practical impact and performance in different scenarios.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Packet Processing and Optimization · Advanced Data Storage Technologies · Network Traffic and Congestion Control
