AttentionPredictor: Temporal Patterns Matter for KV Cache Compression

Qingyue Yang; Jie Wang; Xing Li; Zhihai Wang; Chen Chen; Lei Chen; Xianzhi Yu; Wulong Liu; Jianye Hao; Mingxuan Yuan; Bin Li

arXiv:2502.04077·cs.CL·October 28, 2025

AttentionPredictor: Temporal Patterns Matter for KV Cache Compression

Qingyue Yang, Jie Wang, Xing Li, Zhihai Wang, Chen Chen, Lei Chen, Xianzhi Yu, Wulong Liu, Jianye Hao, Mingxuan Yuan, Bin Li

PDF

Open Access 1 Repo

TL;DR

AttentionPredictor is a novel learning-based method that predicts attention patterns to improve KV cache compression and speed up long-context LLM inference without sacrificing performance.

Contribution

It introduces the first dynamic, learning-based approach for attention prediction in KV cache compression, outperforming static methods and enabling faster decoding.

Findings

01

Achieves 13× KV cache compression

02

Provides 5.6× speedup in decoding

03

Maintains comparable LLM performance

Abstract

With the development of large language models (LLMs), efficient inference through Key-Value (KV) cache compression has attracted considerable attention, especially for long-context generation. To compress the KV cache, recent methods identify critical KV tokens through static modeling of attention scores. However, these methods often struggle to accurately determine critical tokens as they neglect the temporal patterns in attention scores, resulting in a noticeable degradation in LLM performance. To address this challenge, we propose AttentionPredictor, which is the first learning-based method to directly predict attention patterns for KV cache compression and critical token identification. Specifically, AttentionPredictor learns a lightweight, unified convolution model to dynamically capture spatiotemporal patterns and predict the next-token attention scores. An appealing feature of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MIRALab-USTC/LLM-AttentionPredictor
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSoftmax · Attention Is All You Need · Convolution