OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration
Youhe Jiang, Fangcheng Fu, Taiyi Wang, Guoliang He, Eiko Yoneki

TL;DR
OServe is a system designed to improve large language model serving by dynamically adapting to spatial and temporal workload heterogeneity, significantly enhancing performance.
Contribution
It introduces a workload-aware scheduling algorithm and an adaptive switching method for flexible model deployment based on real-time workload changes.
Findings
OServe achieves up to 2× performance improvement over existing systems.
It effectively adapts to workload variations, maintaining high efficiency.
Experimental results on real-world traces validate its advantages.
Abstract
Serving Large Language Models (LLMs) can benefit immensely from parallelizing both the model and input requests across multiple devices, but incoming workloads exhibit substantial spatial and temporal heterogeneity. Spatially, workloads comprise heterogeneous requests with varying compute and memory demands. Temporally, workload composition varies over time. Nevertheless, existing systems typically assume spatially uniform and temporally stable workloads, employing a homogeneous, static model deployment. This mismatch between the assumption and real-world spatial-temporal heterogeneity results in suboptimal performance. We present OServe, an LLM serving system with heterogeneous and flexible model deployment that addresses both spatial and temporal heterogeneity. First, OServe introduces a novel workload-aware scheduling algorithm that optimizes heterogeneous model deployments according…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
