ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing

Kaiwen Chen; Xin Tan; Minchen Yu; Jingzong Li; Hong Xu

arXiv:2507.21433·cs.LG·May 15, 2026

ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing

Kaiwen Chen, Xin Tan, Minchen Yu, Jingzong Li, Hong Xu

PDF

TL;DR

ReasonCache is a novel KV cache management approach that leverages similarity in reasoning steps to significantly improve throughput and reduce latency in large reasoning model serving.

Contribution

It introduces a collaborative filtering-based method for efficient KV cache reuse, enhancing inference QoS without sacrificing accuracy.

Findings

01

Achieves up to 89.2% peak throughput improvement.

02

Realizes 40-60% average throughput gains.

03

Maintains higher accuracy compared to existing cache techniques.

Abstract

Large Reasoning Models (LRMs) are becoming integral to many AI inference systems, enhancing their capabilities with advanced reasoning. However, deploying these models in production environments presents a significant QoS challenge: the substantial memory overhead from their long, auto-regressive inference processes severely limits throughput and increases latency, thereby affecting the quality of service for concurrent users. We observe that LRMs frequently generate highly similar intermediate reasoning steps, which, in turn, correspond to highly similar KV cache states across layers. Building on this insight, we propose ReasonCache, a novel KV cache management approach designed to improve the QoS of AI inference systems. ReasonCache utilizes a Collaborative Filtering Algorithm to efficiently identify reusable KV cache blocks and enables zero-copy cache reuse. Experimental evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.