Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

Gergely Szilvasy (1); Manuel Faysse (1; 2); Maria Lomeli (1); Matthijs Douze (1); Pierre-Emmanuel Mazar\'e (1); Lo\"ic Cabannes (1); Wen-tau Yih (1); Herv\'e J\'egou (1) ((1) Meta FAIR; (2) MICS; CentraleSup\'elec)

arXiv:2605.14037·cs.LG·May 15, 2026

Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

Gergely Szilvasy (1), Manuel Faysse (1, 2), Maria Lomeli (1), Matthijs Douze (1), Pierre-Emmanuel Mazar\'e (1), Lo\"ic Cabannes (1), Wen-tau Yih (1), Herv\'e J\'egou (1) ((1) Meta FAIR, (2) MICS, CentraleSup\'elec)

PDF

TL;DR

This paper introduces SP-KV, a dynamic pruning method for transformer models that predicts the utility of key-value pairs to reduce memory and speed up decoding without sacrificing performance.

Contribution

The paper presents a novel end-to-end trainable utility predictor that enables dynamic, input-adaptive sparsification of KV caches in language models.

Findings

01

Reduces KV cache size by 3 to 10 times on average.

02

Maintains validation loss and downstream task performance.

03

Reveals structured sparsity patterns for future architecture design.

Abstract

Under modern test-time compute and agentic paradigms, language models process ever-longer sequences. Efficient text generation with transformer architectures is increasingly constrained by the Key-Value cache memory footprint and bandwidth. To address this limitation, we introduce Self-Pruned Key-Value Attention (SP-KV), a mechanism designed to predict future KV utility in order to reduce the size of the long-term KV cache. This strategy operates at a fine granularity: a lightweight utility predictor scores each key-value pair, and while recent KVs are always available via a local window, older pairs are written in the cache and used in global attention only if their predicted utility surpasses a given threshold. The LLM and the utility predictor are trained jointly end-to-end exclusively through next-token prediction loss, and are adapted from pretrained LLM checkpoints. Rather than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.