Past-Future Scheduler for LLM Serving under SLA Guarantees
Ruihao Gong, Shihao Bai, Siyu Wu, Yunqian Fan, Zaijun Wang, Xiuhong Li, Hailong Yang, Xianglong Liu

TL;DR
This paper introduces the Past-Future scheduler for LLM serving, which accurately estimates memory needs considering request length distributions, leading to significantly improved throughput under SLA constraints.
Contribution
The paper presents a novel Past-Future scheduler that improves memory estimation accuracy for LLM batching, enabling better throughput and SLA adherence.
Findings
Achieves 2-3x higher goodput than existing schedulers.
Effectively balances request queuing and evictions across diverse scenarios.
Demonstrates superior performance in a new high-performance LLM serving framework.
Abstract
The exploration and application of Large Language Models (LLMs) is thriving. To reduce deployment costs, continuous batching has become an essential feature in current service frameworks. The effectiveness of continuous batching relies on an accurate estimate of the memory requirements of requests. However, due to the diversity in request output lengths, existing frameworks tend to adopt aggressive or conservative schedulers, which often result in significant overestimation or underestimation of memory consumption. Consequently, they suffer from harmful request evictions or prolonged queuing times, failing to achieve satisfactory throughput under strict Service Level Agreement (SLA) guarantees (a.k.a. goodput), across various LLM application scenarios with differing input-output length distributions. To address this issue, we propose a novel Past-Future scheduler that precisely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
