KVzap: Fast, Adaptive, and Faithful KV Cache Pruning
Simon Jegou, Maximilian Jeblick

TL;DR
KVzap is a novel, fast, and adaptive method for pruning key-value caches in transformer models, significantly reducing cache size with minimal accuracy loss during inference.
Contribution
It introduces KVzap, an input-adaptive approximation technique that improves cache pruning speed and effectiveness for large language models.
Findings
Achieves 2-4x KV cache compression with negligible accuracy loss.
Outperforms existing methods on the KVpress leaderboard.
Effective on multiple models and tasks, including long-context and reasoning.
Abstract
Growing context lengths in transformer-based language models have made the key-value (KV) cache a critical inference bottleneck. While many KV cache pruning methods have been proposed, they have not yet been adopted in major inference engines due to speed--accuracy trade-offs. We introduce KVzap, a fast, input-adaptive approximation of KVzip that works in both prefilling and decoding. On Qwen3-8B, Llama-3.1-8B-Instruct, and Qwen3-32B across long-context and reasoning tasks, KVzap achieves -- KV cache compression with negligible accuracy loss and achieves state-of-the-art performance on the KVpress leaderboard. Code and models are available at https://github.com/NVIDIA/kvpress.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/KVzap-linear-Qwen3-8Bmodel· 25 dl· ♡ 125 dl♡ 1
- 🤗nvidia/KVzap-mlp-Qwen3-8Bmodel· 349 dl· ♡ 3349 dl♡ 3
- 🤗nvidia/KVzap-mlp-Qwen3-32Bmodel· 20 dl· ♡ 520 dl♡ 5
- 🤗nvidia/KVzap-linear-Qwen3-32Bmodel· 11 dl· ♡ 311 dl♡ 3
- 🤗nvidia/KVzap-linear-Llama-3.1-8B-Instructmodel· 194 dl194 dl
- 🤗nvidia/KVzap-mlp-Llama-3.1-8B-Instructmodel· 145 dl· ♡ 3145 dl♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Advanced Neural Network Applications
