KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song

TL;DR
KVzip is a novel cache compression method for large language models that reduces memory and latency by effectively evicting less important key-value pairs without query dependence, maintaining performance across diverse tasks.
Contribution
Introduces KVzip, a query-agnostic cache eviction technique that compresses KV caches in LLMs, significantly reducing memory and latency while preserving task performance.
Findings
Reduces KV cache size by 3-4 times
Halves FlashAttention decoding latency
Maintains performance across multiple tasks and models
Abstract
Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by - and FlashAttention decoding latency by approximately , with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1, Qwen2.5, and Gemma3, with context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/KVzap-linear-Qwen3-8Bmodel· 25 dl· ♡ 125 dl♡ 1
- 🤗nvidia/KVzap-mlp-Qwen3-8Bmodel· 349 dl· ♡ 3349 dl♡ 3
- 🤗nvidia/KVzap-mlp-Qwen3-32Bmodel· 20 dl· ♡ 520 dl♡ 5
- 🤗nvidia/KVzap-linear-Qwen3-32Bmodel· 11 dl· ♡ 311 dl♡ 3
- 🤗nvidia/KVzap-linear-Llama-3.1-8B-Instructmodel· 194 dl194 dl
- 🤗nvidia/KVzap-mlp-Llama-3.1-8B-Instructmodel· 145 dl· ♡ 3145 dl♡ 3
Videos
Taxonomy
MethodsSoftmax · Attention Is All You Need
