SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

Jiayi Tian; Seyedarmin Azizi; Yequan Zhao; Erfan Baghaei Potraghloo; Sean McPherson; Sharath Nittur Sridhar; Zhengyang Wang; Zheng Zhang; Massoud Pedram; Souvik Kundu

arXiv:2512.07993·cs.AI·April 21, 2026

SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

Jiayi Tian, Seyedarmin Azizi, Yequan Zhao, Erfan Baghaei Potraghloo, Sean McPherson, Sharath Nittur Sridhar, Zhengyang Wang, Zheng Zhang, Massoud Pedram, Souvik Kundu

PDF

1 Repo

TL;DR

SkipKV is a training-free method that improves large reasoning models' inference efficiency by selectively evicting and generating sentence-level sequences, reducing KV cache size and response length.

Contribution

It introduces a novel sentence-scoring metric and dynamic steering vector to enable efficient, semantic-aware KV compression without retraining.

Findings

01

Achieves up to 26.7% higher accuracy with compression

02

Yields up to 1.6x shorter generation length

03

Improves throughput by up to 1.7x

Abstract

Large reasoning models (LRMs) often incur significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning. This incurs both memory overhead and throughput bottlenecks, limiting efficient deployment. To reduce KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and reduced effective KV budget caused by padding, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in multi-batch settings. Additionally, these methods often generate longer sequences than the original model without eviction, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present \textbf{SkipKV}, a \textbf{\textit{training-free}} KV compression method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

TTTTTTris/SkipKV
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.