Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

Jang-Hyun Kim; Dongyoon Han; Sangdoo Yun

arXiv:2601.17668·cs.LG·February 10, 2026

Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

Jang-Hyun Kim, Dongyoon Han, Sangdoo Yun

PDF

Open Access

TL;DR

Fast KVzip introduces a gating-based KV cache eviction method for LLMs that achieves high compression with minimal performance loss, enabling efficient inference across diverse tasks.

Contribution

It presents a novel, lightweight gating mechanism for KV cache eviction in frozen LLMs that maintains performance while significantly reducing cache size.

Findings

01

Up to 70% KV cache eviction with negligible performance loss.

02

Effective across multiple LLMs and tasks including reasoning and code comprehension.

03

Seamless integration into existing LLM inference pipelines.

Abstract

Efficient key-value (KV) cache management is crucial for the practical deployment of large language models (LLMs), yet existing compression techniques often incur a trade-off between performance degradation and computational overhead. We propose a novel gating-based KV cache eviction method for frozen-weight LLMs that achieves high compression ratios with negligible computational cost. Our approach introduces lightweight sink-attention gating modules to identify and retain critical KV pairs, and integrates seamlessly into both the prefill and decoding stages. The proposed gate training algorithm relies on forward passes of an LLM, avoiding expensive backpropagation, while achieving strong task generalization through a task-agnostic reconstruction objective. Extensive experiments across the Qwen2.5-1M, Qwen3, and Gemma3 families show that our method maintains near-lossless performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Parallel Computing and Optimization Techniques · Big Data and Digital Economy