R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

Zefan Cai; Wen Xiao; Hanshi Sun; Cheng Luo; Yikai Zhang; Ke Wan; Yucheng Li; Yeyang Zhou; Li-Wen Chang; Jiuxiang Gu; Zhen Dong; Anima Anandkumar; Abedelkadir Asi; Junjie Hu

arXiv:2505.24133·cs.CL·January 23, 2026

R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, Junjie Hu

PDF

Open Access 1 Repo

TL;DR

R-KV is a novel redundancy-aware KV cache compression method that significantly reduces memory usage and improves inference throughput in reasoning models by preserving performance with much smaller caches.

Contribution

The paper introduces R-KV, a new approach that effectively compresses KV caches by removing redundancy, outperforming existing methods in reasoning model inference.

Findings

01

Achieves nearly 100% of full KV cache performance with only 10% cache size.

02

Outperforms existing KV cache baselines, reaching 105% performance with 16% cache.

03

Reduces memory usage by 90% and increases throughput by 6.6 times.

Abstract

Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV cache compression approaches. To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models. Our method preserves nearly 100% of the full KV cache performance using only 10% of the KV cache, substantially outperforming existing KV cache baselines, which reach only 60% of the performance. Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zefan-cai/r-kv
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Bayesian Modeling and Causal Inference