WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving
Chiheng Lou, Sheng Qi, Rui Kang, Yong Zhang, Chen Sun, Pengcheng Wang, Xuanzhe Liu, Xin Jin

TL;DR
WarmServe introduces a proactive GPU prewarming approach for multi-LLM serving, leveraging workload predictability to significantly reduce latency and improve throughput in shared GPU environments.
Contribution
The paper presents WarmServe, a novel system with algorithms for workload-aware prewarming, GPU memory management, and interference minimization, enabling efficient multi-LLM deployment.
Findings
Reduces tail TTFT by up to 50.8 times compared to autoscaling systems.
Supports up to 2.5 times higher request throughput than GPU-sharing systems.
Leverages workload periodicity for proactive model preloading.
Abstract
Deploying multiple models within shared GPU clusters is a key strategy to improve resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems improve GPU utilization at the cost of degraded inference performance, particularly time-to-first-token (TTFT). We attribute this degradation to the lack of awareness regarding future workload characteristics. In contrast, recent analyses have shown the strong periodicity and long-term predictability of real-world LLM serving workloads. In this paper, we propose one-for-many GPU prewarming, which proactively loads parameters from multiple models onto GPUs based on workload forecasts. These prewarmed weights enable the system to promptly instantiate serving instances upon encountering request bursts. We design and implement WarmServe, a multi-LLM serving system incorporating three key techniques: (1) a model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Natural Language Processing Techniques
