Multi-stage Flow Scheduling for LLM Serving
Yijun Sun (1), Xudong Liao (1), Songrun Xie (1), Hao Chen (2), Han Tian (3), Wenxue Li (1), Yiming Zhang (2), Kai Chen (1) ((1) Hong Kong University of Science, Technology, (2) Shanghai Jiao Tong University, (3) University of Science, Technology of China)

TL;DR
This paper introduces MFS, a multi-stage flow scheduling mechanism for LLM serving that improves time-to-first-token SLOs by dynamically prioritizing network flows based on laxity, outperforming existing methods.
Contribution
MFS is a novel, stage-aware scheduling approach that approximates LLF without precise slack knowledge, enhancing LLM serving efficiency.
Findings
MFS improves TTFT SLO attainment by 1.2x to 2.4x.
MFS effectively manages complex multi-stage workflows in LLM serving.
Evaluation on real and simulated systems confirms MFS's superior performance.
Abstract
Meeting stringent Time-To-First-Token (TTFT) requirements is crucial for LLM applications. To improve efficiency, modern LLM serving systems adopt disaggregated architectures with diverse parallelisms, introducing complex multi-stage workflows involving reusable KV-block retrieval, collective communication, and P2D transfer. Flows from dependent stages overlap within and across requests on shared bottleneck links, making TTFT highly susceptible to network contention and necessitating stage-aware scheduling. Unfortunately, most existing works schedule flows in a stage-agnostic manner, leading to uncoordinated contention that constitutes a primary cause of SLO violations. In this paper, we present MFS, a holistic multi-stage flow scheduling mechanism designed to maximize TTFT SLO attainment. At its core, MFS approximates the Least-Laxity-First (LLF) scheduling policy without requiring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Advanced Data Storage Technologies
