TL;DR
KVServe is an adaptive, service-aware framework that optimizes KV cache compression in disaggregated LLM serving, significantly reducing latency and improving throughput by dynamically selecting compression strategies.
Contribution
It introduces a unified, modular compression strategy space, an efficient Bayesian profiling engine, and an online controller for real-time profile selection in LLM serving.
Findings
Achieves up to 9.13× JCT speedup in PD-separated serving.
Reduces KV disaggregated serving latency by up to 32.8×.
Demonstrates effectiveness across datasets, models, GPUs, and networks.
Abstract
LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets. As a result, a fixed choice can be suboptimal or even increase latency. We present \emph{KVServe}, the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving: KVServe (1) unifies KV compression into a modular strategy space with new components and cross-method recomposition; (2) introduces Bayesian Profiling Engine that efficiently searches this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
