Past-Future Scheduler for LLM Serving under SLA Guarantees

Ruihao Gong; Shihao Bai; Siyu Wu; Yunqian Fan; Zaijun Wang; Xiuhong Li; Hailong Yang; Xianglong Liu

arXiv:2507.10150·cs.DC·July 15, 2025

Past-Future Scheduler for LLM Serving under SLA Guarantees

Ruihao Gong, Shihao Bai, Siyu Wu, Yunqian Fan, Zaijun Wang, Xiuhong Li, Hailong Yang, Xianglong Liu

PDF

TL;DR

This paper introduces the Past-Future scheduler for LLM serving, which accurately estimates memory needs considering request length distributions, leading to significantly improved throughput under SLA constraints.

Contribution

The paper presents a novel Past-Future scheduler that improves memory estimation accuracy for LLM batching, enabling better throughput and SLA adherence.

Findings

01

Achieves 2-3x higher goodput than existing schedulers.

02

Effectively balances request queuing and evictions across diverse scenarios.

03

Demonstrates superior performance in a new high-performance LLM serving framework.

Abstract

The exploration and application of Large Language Models (LLMs) is thriving. To reduce deployment costs, continuous batching has become an essential feature in current service frameworks. The effectiveness of continuous batching relies on an accurate estimate of the memory requirements of requests. However, due to the diversity in request output lengths, existing frameworks tend to adopt aggressive or conservative schedulers, which often result in significant overestimation or underestimation of memory consumption. Consequently, they suffer from harmful request evictions or prolonged queuing times, failing to achieve satisfactory throughput under strict Service Level Agreement (SLA) guarantees (a.k.a. goodput), across various LLM application scenarios with differing input-output length distributions. To address this issue, we propose a novel Past-Future scheduler that precisely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.