LLM Serving Optimization with Variable Prefill and Decode Lengths

Meixuan Wang; Yinyu Ye; Zijie Zhou

arXiv:2508.06133·math.OC·February 11, 2026

LLM Serving Optimization with Variable Prefill and Decode Lengths

Meixuan Wang, Yinyu Ye, Zijie Zhou

PDF

Open Access

TL;DR

This paper addresses optimizing large language model serving by scheduling requests with variable prompt and decode lengths under memory constraints, proposing a new algorithm with theoretical guarantees and practical benefits.

Contribution

It introduces Sorted-F, a novel batch scheduling algorithm with provable latency bounds, tailored for heterogeneous LLM request workloads under memory limits.

Findings

01

Sorted-F achieves constant-factor latency guarantees.

02

Practical variants outperform standard heuristics in experiments.

03

Scheduling strategies significantly impact latency during peak workloads.

Abstract

We study offline scheduling for large language model (LLM) serving under a fixed KV-cache memory budget, where requests have heterogeneous prompt (prefill) and response (decode) lengths. Prompt tokens determine initial KV usage, and each generated token increases memory by one unit. Given a backlog of n requests arriving together, we schedule mixed prefill and decode batches to minimize total end-to-end latency. We show that heterogeneity in prompt lengths makes the problem computationally intractable and that widely used heuristics such as first-come-first-served and shortest-first can be arbitrarily suboptimal. We propose Sorted-F, which repeatedly forms feasible batches using a new selection metric that balances batch size against downstream decode cost, and prove it achieves a constant-factor guarantee on total latency. We further develop practical variants -- an exact solver for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Cloud Computing and Resource Management