KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

Jang-Hyun Kim; Jinuk Kim; Sangwoo Kwon; Jae W. Lee; Sangdoo Yun; Hyun Oh Song

arXiv:2505.23416·cs.DB·October 1, 2025

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song

PDF

1 Repo 6 Models 1 Datasets 1 Video

TL;DR

KVzip is a novel cache compression method for large language models that reduces memory and latency by effectively evicting less important key-value pairs without query dependence, maintaining performance across diverse tasks.

Contribution

Introduces KVzip, a query-agnostic cache eviction technique that compresses KV caches in LLMs, significantly reducing memory and latency while preserving task performance.

Findings

01

Reduces KV cache size by 3-4 times

02

Halves FlashAttention decoding latency

03

Maintains performance across multiple tasks and models

Abstract

Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by $3$ - $4 \times$ and FlashAttention decoding latency by approximately $2 \times$ , with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1, Qwen2.5, and Gemma3, with context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

snu-mllab/kvzip
pytorchOfficial

Models

Datasets

Jang-Hyun/SCBench-preprocessed
dataset· 2.5k dl
2.5k dl

Videos

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction· slideslive

Taxonomy

MethodsSoftmax · Attention Is All You Need