Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate

Fangyue Liu; Hua Liu; Xinyuan Lyu; Shuo Ai; Hao Liang; Lingpeng Chen; Ziqian Hu; Chong Zha; Xin Jin; Hanmei Luo; Peng Chen

arXiv:2604.07874·cs.OS·April 10, 2026

Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate

Fangyue Liu, Hua Liu, Xinyuan Lyu, Shuo Ai, Hao Liang, Lingpeng Chen, Ziqian Hu, Chong Zha, Xin Jin, Hanmei Luo, Peng Chen

PDF

TL;DR

Valve is a GPU runtime system that enables efficient online-offline inference colocation with bounded preemption latency and rate, significantly improving resource utilization in production environments.

Contribution

Valve introduces a practical GPU runtime that guarantees bounded preemption latency and rate, requiring minimal modifications and enabling high utilization in production.

Findings

01

Improves cluster utilization by 34.6% in production.

02

Achieves sub-millisecond preemption with minimal online interference.

03

Reduces GPU resource requirements by saving 2,170 GPUs.

Abstract

LLM inference powers latency-critical production services nowadays. The bursty nature of inference traffic results in over-provisioning, which in turn leads to resource underutilization. While online-offline colocation promises to utilize idle capacity, broad production deployment must overcome two major challenges: (i) large online interference due to slow or frequent preemptions, and (ii) extensive frameworks and drivers modifications, to colocate different models and support preemptions. We present Valve, a production-friendly colocation system that jointly bounds preemption latency and preemption rate. Specifically, Valve enables sub-millisecond compute preemption at most once per online request, and rate-limited sub-layer memory reclamation. These guaranties are provided by a GPU runtime that combines channel-controlled compute isolation, page-fault-free memory reclamation, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.