LLM Serving Optimization with Variable Prefill and Decode Lengths
Meixuan Wang, Yinyu Ye, Zijie Zhou

TL;DR
This paper addresses optimizing large language model serving by scheduling requests with variable prompt and decode lengths under memory constraints, proposing a new algorithm with theoretical guarantees and practical benefits.
Contribution
It introduces Sorted-F, a novel batch scheduling algorithm with provable latency bounds, tailored for heterogeneous LLM request workloads under memory limits.
Findings
Sorted-F achieves constant-factor latency guarantees.
Practical variants outperform standard heuristics in experiments.
Scheduling strategies significantly impact latency during peak workloads.
Abstract
We study offline scheduling for large language model (LLM) serving under a fixed KV-cache memory budget, where requests have heterogeneous prompt (prefill) and response (decode) lengths. Prompt tokens determine initial KV usage, and each generated token increases memory by one unit. Given a backlog of n requests arriving together, we schedule mixed prefill and decode batches to minimize total end-to-end latency. We show that heterogeneity in prompt lengths makes the problem computationally intractable and that widely used heuristics such as first-come-first-served and shortest-first can be arbitrarily suboptimal. We propose Sorted-F, which repeatedly forms feasible batches using a new selection metric that balances batch size against downstream decode cost, and prove it achieves a constant-factor guarantee on total latency. We further develop practical variants -- an exact solver for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Cloud Computing and Resource Management
