Large-Scale LLM Inference with Heterogeneous Workloads: Prefill-Decode Contention and Asymptotically Optimal Control
Ruihan Lin, Zezhen Ding, Zean Han, Jiheng Zhang

TL;DR
This paper presents a stochastic control framework for efficiently scheduling heterogeneous large language model inference workloads on GPU clusters, addressing contention between prefill and decode phases to optimize performance.
Contribution
It introduces a novel queueing network model and asymptotically optimal control policies for managing LLM inference workloads with workload heterogeneity and resource contention.
Findings
Proposed policies outperform standard heuristics in simulations.
Developed a unified framework incorporating latency and fairness constraints.
Proved asymptotic optimality of scheduling policies in many-GPU regimes.
Abstract
Large Language Models (LLMs) are rapidly becoming critical infrastructure for enterprise applications, driving unprecedented demand for GPU-based inference services. A key operational challenge arises from the two-phase nature of LLM inference: a compute-intensive \emph{prefill} phase that processes user input, followed by a memory-bound \emph{decode} phase that generates output tokens. When these phases share GPU resources, prefill tasks throttle the processing speed of concurrent decodes, creating state-dependent contention. This contention is further complicated by workload heterogeneity, as different applications exhibit vastly different input and output lengths. We develop a stochastic control framework for scheduling heterogeneous LLM workloads across large GPU clusters. We formulate LLM inference as a multiclass many-server queueing network with state-dependent service rates,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Cloud Computing and Resource Management · IoT and Edge/Fog Computing
