G-KV: Decoding-Time KV Cache Eviction with Global Attention

Mengqi Liao; Lu Wang; Chaoyun Zhang; Zekai Shen; Xiaowei Mao; Si Qin; Qingwei Lin; Saravan Rajmohan; Dongmei Zhang; Huaiyu Wan

arXiv:2512.00504·cs.CL·December 2, 2025

G-KV: Decoding-Time KV Cache Eviction with Global Attention

Mengqi Liao, Lu Wang, Chaoyun Zhang, Zekai Shen, Xiaowei Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Huaiyu Wan

PDF

Open Access

TL;DR

G-KV introduces a global attention-based token eviction method for KV cache compression in large language models, improving long-term token importance assessment and optimizing model performance with post-training techniques.

Contribution

The paper presents G-KV, a novel KV cache eviction strategy using global attention scores and post-training optimization methods for better efficiency in reasoning LLMs.

Findings

01

Enhanced token importance evaluation with global attention scores

02

Effective KV cache compression through G-KV method

03

Improved reasoning efficiency in LLMs

Abstract

Recent reasoning large language models (LLMs) excel in complex tasks but encounter significant computational and memory challenges due to long sequence lengths. KV cache compression has emerged as an effective approach to greatly enhance the efficiency of reasoning. However, existing methods often focus on prompt compression or token eviction with local attention score, overlooking the long-term importance of tokens. We propose G-KV, a KV cache eviction method that employs a global scoring mechanism, combining local and historical attention scores to more accurately assess token importance. Additionally, we introduce post-training techniques, including reinforcement learning and distillation, to optimize models for compressed KV cache settings. The code of this paper is available on: https://github.com/microsoft/G-KV.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)