Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees

Zhuolun Dong; Junyu Cao

arXiv:2604.11001·cs.LG·April 14, 2026

Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees

Zhuolun Dong, Junyu Cao

PDF

TL;DR

This paper introduces a flow-control scheduling framework for LLM inference that guarantees system stability and improves throughput and latency by managing prompt processing rates.

Contribution

It proposes a novel flow-control algorithm with provable stability guarantees tailored for large language model inference systems.

Findings

01

Achieves higher token and request throughput compared to existing strategies.

02

Reduces average and tail latency in LLM inference.

03

Ensures more stable KV cache utilization under load.

Abstract

Large language models (LLMs) have been widely adopted due to their great performance across a wide range of applications. ChatGPT and Gemini now serve hundreds of millions of active users and handle billions of user requests per day, which puts optimizing LLM inference into the spotlight. A key challenge in LLM inference is that decode lengths are unknown. The memory usage for each request grows with generated tokens, which may lead to overflow and cause system instability. To address this concern, we propose a simple flow-control framework that controls the rate at which prompts join the active set. We derive a necessary condition that any stable system must satisfy and establish sufficient conditions under which our algorithm provably achieves stability. Experiments show that, compared to commonly used strategies in practice, our approach achieves higher token and request throughput,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.