KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
Zhiqing Zhong, Zhijing Ye, Jian Zhang, Weijian Zheng, Bolun Sun, Xiaodong Yu

TL;DR
KV-RM introduces a runtime design that regularizes KV-cache movement beneath static-graph LLM decoders, improving throughput, latency, and memory efficiency by decoupling logical histories from physical storage.
Contribution
It proposes a novel KV-cache management approach that absorbs variability below the fixed decode interface, enhancing static-graph LLM serving performance.
Findings
Improves decoding throughput and tail latency on NVIDIA A100 GPUs.
Reduces reserved KV memory across different workloads.
Eliminates burst-time latency spikes in production-trace replay.
Abstract
Static-graph LLM decoders provide predictable launches, fixed tensor shapes, and low submission overhead, but online decoding exposes highly irregular KV-cache behavior: request lengths differ, EOS events arrive asynchronously, and logical histories fragment over time. Dynamic runtimes recover flexibility through paged KV management and step-level scheduling, while static-graph executors often over-reserve memory and suffer burst-time latency outliers. This paper studies whether much of this variability can be absorbed below a fixed decode interface. We present KV-RM, a runtime design that regularizes KV-cache movement beneath a static-graph LLM decoder. KV-RM decouples logical KV histories from physical storage, tracks active KV state through a block pager, and materializes each decode step through a single committed descriptor. A merge-staged transport path coalesces non-contiguous KV…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
