Large-Scale LLM Inference with Heterogeneous Workloads: Prefill-Decode Contention and Asymptotically Optimal Control

Ruihan Lin; Zezhen Ding; Zean Han; Jiheng Zhang

arXiv:2602.02987·cs.DC·February 4, 2026

Large-Scale LLM Inference with Heterogeneous Workloads: Prefill-Decode Contention and Asymptotically Optimal Control

Ruihan Lin, Zezhen Ding, Zean Han, Jiheng Zhang

PDF

Open Access

TL;DR

This paper presents a stochastic control framework for efficiently scheduling heterogeneous large language model inference workloads on GPU clusters, addressing contention between prefill and decode phases to optimize performance.

Contribution

It introduces a novel queueing network model and asymptotically optimal control policies for managing LLM inference workloads with workload heterogeneity and resource contention.

Findings

01

Proposed policies outperform standard heuristics in simulations.

02

Developed a unified framework incorporating latency and fairness constraints.

03

Proved asymptotic optimality of scheduling policies in many-GPU regimes.

Abstract

Large Language Models (LLMs) are rapidly becoming critical infrastructure for enterprise applications, driving unprecedented demand for GPU-based inference services. A key operational challenge arises from the two-phase nature of LLM inference: a compute-intensive \emph{prefill} phase that processes user input, followed by a memory-bound \emph{decode} phase that generates output tokens. When these phases share GPU resources, prefill tasks throttle the processing speed of concurrent decodes, creating state-dependent contention. This contention is further complicated by workload heterogeneity, as different applications exhibit vastly different input and output lengths. We develop a stochastic control framework for scheduling heterogeneous LLM workloads across large GPU clusters. We formulate LLM inference as a multiclass many-server queueing network with state-dependent service rates,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Cloud Computing and Resource Management · IoT and Edge/Fog Computing