KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction

Aomufei Yuan; Zhiming Wang; Ruijie Miao; Dayu Wang; Yuxuan Tian; Zihan Wang; Yebo Peng; Yuhan Wu; Bairen Yi; Xin Liu; Tong Yang

arXiv:2512.17917·cs.CL·December 23, 2025

KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction

Aomufei Yuan, Zhiming Wang, Ruijie Miao, Dayu Wang, Yuxuan Tian, Zihan Wang, Yebo Peng, Yuhan Wu, Bairen Yi, Xin Liu, Tong Yang

PDF

Open Access

TL;DR

KVReviver introduces a reversible cache compression technique for large language models that significantly reduces memory usage while preserving inference accuracy, enabling efficient processing of longer contexts.

Contribution

The paper presents KVReviver, a novel sketch-based reversible KV cache compression method that prevents information loss and improves memory efficiency in LLMs.

Findings

01

Requires only 10% of KV cache for 2k contexts with no accuracy loss.

02

Achieves ~2% accuracy loss with 25% cache for 32k contexts.

03

Enables full-scale computation within limited memory constraints.

Abstract

As the context length of current large language models (LLMs) rapidly increases, the memory demand for the Key-Value (KV) cache is becoming a bottleneck for LLM deployment and batch processing. Traditional KV cache compression methods typically involve permanently evicting or irreversibly merging "less important" tokens with low attention scores. This approach results in the unrecoverable loss of token information, which we call Contextual Amnesia, significantly degrading the model's information retrieval capability. To address this issue, we propose KVReviver, a reversible KV cache compression method based on the sketch algorithm. This method allows reconstructing compressed tokens from an additional data structure, thus enabling full-scale computation within limited memory. Experiments showed that in 2k-length contexts, it requires only 10% of KV Cache budget while maintaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Natural Language Processing Techniques