WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

Chiheng Lou; Sheng Qi; Rui Kang; Yong Zhang; Chen Sun; Pengcheng Wang; Xuanzhe Liu; Xin Jin

arXiv:2512.09472·cs.DC·May 22, 2026

WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

Chiheng Lou, Sheng Qi, Rui Kang, Yong Zhang, Chen Sun, Pengcheng Wang, Xuanzhe Liu, Xin Jin

PDF

TL;DR

WarmServe introduces a proactive GPU prewarming approach for multi-LLM serving, leveraging workload predictability to significantly reduce latency and improve throughput in shared GPU environments.

Contribution

The paper presents WarmServe, a novel system with algorithms for workload-aware prewarming, GPU memory management, and interference minimization, enabling efficient multi-LLM deployment.

Findings

01

Reduces tail TTFT by up to 50.8 times compared to autoscaling systems.

02

Supports up to 2.5 times higher request throughput than GPU-sharing systems.

03

Leverages workload periodicity for proactive model preloading.

Abstract

Deploying multiple models within shared GPU clusters is a key strategy to improve resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems improve GPU utilization at the cost of degraded inference performance, particularly time-to-first-token (TTFT). We attribute this degradation to the lack of awareness regarding future workload characteristics. In contrast, recent analyses have shown the strong periodicity and long-term predictability of real-world LLM serving workloads. In this paper, we propose one-for-many GPU prewarming, which proactively loads parameters from multiple models onto GPUs based on workload forecasts. These prewarmed weights enable the system to promptly instantiate serving instances upon encountering request bursts. We design and implement WarmServe, a multi-LLM serving system incorporating three key techniques: (1) a model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Natural Language Processing Techniques