Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate
Fangyue Liu, Hua Liu, Xinyuan Lyu, Shuo Ai, Hao Liang, Lingpeng Chen, Ziqian Hu, Chong Zha, Xin Jin, Hanmei Luo, Peng Chen

TL;DR
Valve is a GPU runtime system that enables efficient online-offline inference colocation with bounded preemption latency and rate, significantly improving resource utilization in production environments.
Contribution
Valve introduces a practical GPU runtime that guarantees bounded preemption latency and rate, requiring minimal modifications and enabling high utilization in production.
Findings
Improves cluster utilization by 34.6% in production.
Achieves sub-millisecond preemption with minimal online interference.
Reduces GPU resource requirements by saving 2,170 GPUs.
Abstract
LLM inference powers latency-critical production services nowadays. The bursty nature of inference traffic results in over-provisioning, which in turn leads to resource underutilization. While online-offline colocation promises to utilize idle capacity, broad production deployment must overcome two major challenges: (i) large online interference due to slow or frequent preemptions, and (ii) extensive frameworks and drivers modifications, to colocate different models and support preemptions. We present Valve, a production-friendly colocation system that jointly bounds preemption latency and preemption rate. Specifically, Valve enables sub-millisecond compute preemption at most once per online request, and rate-limited sub-layer memory reclamation. These guaranties are provided by a GPU runtime that combines channel-controlled compute isolation, page-fault-free memory reclamation, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
