Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference
Qipeng Wang, Zhendong Yang

TL;DR
This paper introduces Kairos, an SLO-aware scheduling system for disaggregated LLM inference that improves service-level adherence and throughput by dynamically prioritizing requests and adaptively batching decode requests.
Contribution
Kairos is a novel scheduling system that combines urgency-based prefill prioritization and slack-guided batching to optimize LLM inference performance.
Findings
Kairos improves TTFT SLO attainment by up to 23.9%.
Kairos increases TPOT SLO attainment by up to 27.1%.
Kairos enhances end-to-end SLO attainment by up to 33.8%.
Abstract
In production environments, large language model (LLM) serving is required to meet stringent service-level objectives (SLOs) amid highly variable request patterns. In practice, request lengths follow a long-tail distribution, which gives rise to head-of-line blocking on the prefill side and underutilization caused by stragglers on the decode side in disaggregated serving architectures. Current systems, which adopt first-come-first-served (FCFS) scheduling for prefill and continuous batching for decode, lack the ability to adapt to this imbalance, resulting in compromised SLO attainment and reduced throughput. To address these challenges, we propose Kairos, an SLO-aware scheduling system equipped with two complementary mechanisms. On the prefill side, Kairos employs urgency-based priority scheduling: it predicts prefill completion times and dynamically selects requests to maximize the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
