KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

Zhiqing Zhong; Zhijing Ye; Jian Zhang; Weijian Zheng; Bolun Sun; Xiaodong Yu

arXiv:2605.09735·cs.AR·May 12, 2026

KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

Zhiqing Zhong, Zhijing Ye, Jian Zhang, Weijian Zheng, Bolun Sun, Xiaodong Yu

PDF

TL;DR

KV-RM introduces a runtime design that regularizes KV-cache movement beneath static-graph LLM decoders, improving throughput, latency, and memory efficiency by decoupling logical histories from physical storage.

Contribution

It proposes a novel KV-cache management approach that absorbs variability below the fixed decode interface, enhancing static-graph LLM serving performance.

Findings

01

Improves decoding throughput and tail latency on NVIDIA A100 GPUs.

02

Reduces reserved KV memory across different workloads.

03

Eliminates burst-time latency spikes in production-trace replay.

Abstract

Static-graph LLM decoders provide predictable launches, fixed tensor shapes, and low submission overhead, but online decoding exposes highly irregular KV-cache behavior: request lengths differ, EOS events arrive asynchronously, and logical histories fragment over time. Dynamic runtimes recover flexibility through paged KV management and step-level scheduling, while static-graph executors often over-reserve memory and suffer burst-time latency outliers. This paper studies whether much of this variability can be absorbed below a fixed decode interface. We present KV-RM, a runtime design that regularizes KV-cache movement beneath a static-graph LLM decoder. KV-RM decouples logical KV histories from physical storage, tracks active KV state through a block pager, and materializes each decode step through a single committed descriptor. A merge-staged transport path coalesces non-contiguous KV…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.