In-context KV-Cache Eviction for LLMs via Attention-Gate

Zihao Zeng; Bokai Lin; Tianqi Hou; Hao Zhang; Zhijie Deng

arXiv:2410.12876·cs.CL·April 18, 2025

In-context KV-Cache Eviction for LLMs via Attention-Gate

Zihao Zeng, Bokai Lin, Tianqi Hou, Hao Zhang, Zhijie Deng

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Attention-Gate, a lightweight module for dynamic KV-Cache eviction in large language models, reducing memory bottlenecks and improving inference efficiency without significant overhead.

Contribution

It proposes a novel Attention-Gate mechanism that enables dynamic, tunable eviction of tokens in KV-Cache during LLM inference, adaptable to pre-trained models.

Findings

01

Reduces memory usage during inference.

02

Improves model efficiency and performance.

03

Minimal additional computational overhead.

Abstract

The KV-Cache technique has become the standard for the inference of large language models (LLMs). Yet, it is widely criticized that KV-Cache can become a bottleneck of the LLM inference system. This paper enables a novel dynamic KV-Cache eviction policy by injecting a lightweight module called Attention-Gate to the model. It accepts the global context as input and yields eviction flags for each token. The self-attention modules in the model proceed according to the flags and cache only a subset of the KV states for next token prediction. The Attention-Gates can yield various flags for different heads and layers and be easily tuned on top of a pre-trained LLM via continual pre-training or supervised fine-tuning. The computational and memory overhead introduced by Attention-Gates can be minimal. We empirically evaluate the proposed approach across multiple scenarios, showing that…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 5

Strengths

- The design choices of the proposed method are clearly motivated. - Presentation is clear and easy-to-follow

Weaknesses

- Questionable choices of the evaluation benchmarks. The paper motivates the importance of KV-cache reduction by stating that “KV-cache can become a bottleneck when dealing with large models and long-context queries.” However, to the best of my knowledge, none of the benchmarks in this paper is “long-context,” nor are the models “large.” Besides, I am not sure benchmarks such as RTE and COPA are the best candidates for properly evaluating these models’ performance, compared to more “up-to-date”

Reviewer 02Rating 5Confidence 4

Strengths

- Attention-Gate is technically sound and easy to implement. The experiment showed its performance in both continued pre-training and lightweight fine-tuning settings with clear visualization of Attention Maps. - Ablation studies are thoroughly conducted on the impact of the number/dimension of AG heads and the structure choice of AG.

Weaknesses

- The baselines like Local, StreamingLLM, and H2O are weak. As the motivation of Attention-Gate mechanism is to address the attention bias issue, some follow-up works like NACL and A2SF with the same motivation should be included as the baselines to fully show the effectiveness of AG. - The additional training phase in Attention-Gate may influence the performance of the original LLama2-7B model, so the performance of LLama2-7B with continued training without AG mechanism should be reported.

Reviewer 03Rating 3Confidence 5

Strengths

1. Compared to traditional static or local dynamic eviction methods, Attention-Gate offers a flexible and adaptive approach to KV-Cache eviction in LLMs, enhancing cache management by adjusting to contextual needs. 2. Attention-Gate opts for the attention-like structure to collect contextual information, which can vary among attention heads/layers and be seamlessly plugged into pre-trained LLMs.

Weaknesses

1. Limited architectural detail: The paper lacks a thorough explanation of the detailed design of Attention-Gate, particularly regarding its network architecture and computation equations. 2. Insufficient experimental analysis: The experiments would benefit from more comprehensive analysis to better illustrate the efficiency and effectiveness of Attention-Gate, providing a clearer understanding of its practical impact and performance in different scenarios.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Packet Processing and Optimization · Advanced Data Storage Technologies · Network Traffic and Congestion Control