KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

Zedong Liu; Xinyang Ma; Dejun Luo; Hairui Zhao; Bing Lu; Wenjing Huang; Yida Gu; Xingchen Liu; Zheng Wei; Jinyang Liu; Dingwen Tao; Guangming Tan

arXiv:2605.13734·cs.DC·May 14, 2026

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

Zedong Liu, Xinyang Ma, Dejun Luo, Hairui Zhao, Bing Lu, Wenjing Huang, Yida Gu, Xingchen Liu, Zheng Wei, Jinyang Liu, Dingwen Tao, Guangming Tan

PDF

1 Repo

TL;DR

KVServe is an adaptive, service-aware framework that optimizes KV cache compression in disaggregated LLM serving, significantly reducing latency and improving throughput by dynamically selecting compression strategies.

Contribution

It introduces a unified, modular compression strategy space, an efficient Bayesian profiling engine, and an online controller for real-time profile selection in LLM serving.

Findings

01

Achieves up to 9.13× JCT speedup in PD-separated serving.

02

Reduces KV disaggregated serving latency by up to 32.8×.

03

Demonstrates effectiveness across datasets, models, GPUs, and networks.

Abstract

LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets. As a result, a fixed choice can be suboptimal or even increase latency. We present \emph{KVServe}, the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving: KVServe (1) unifies KV compression into a modular strategy space with new components and cross-method recomposition; (2) introduces Bayesian Profiling Engine that efficiently searches this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hpdps-group/KVServe
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.