Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees
Zhuolun Dong, Junyu Cao

TL;DR
This paper introduces a flow-control scheduling framework for LLM inference that guarantees system stability and improves throughput and latency by managing prompt processing rates.
Contribution
It proposes a novel flow-control algorithm with provable stability guarantees tailored for large language model inference systems.
Findings
Achieves higher token and request throughput compared to existing strategies.
Reduces average and tail latency in LLM inference.
Ensures more stable KV cache utilization under load.
Abstract
Large language models (LLMs) have been widely adopted due to their great performance across a wide range of applications. ChatGPT and Gemini now serve hundreds of millions of active users and handle billions of user requests per day, which puts optimizing LLM inference into the spotlight. A key challenge in LLM inference is that decode lengths are unknown. The memory usage for each request grows with generated tokens, which may lead to overflow and cause system instability. To address this concern, we propose a simple flow-control framework that controls the rate at which prompts join the active set. We derive a necessary condition that any stable system must satisfy and establish sufficient conditions under which our algorithm provably achieves stability. Experiments show that, compared to commonly used strategies in practice, our approach achieves higher token and request throughput,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
