PecSched: Preemptive and Efficient Cluster Scheduling for LLM Inference
Zeyu Zhang, Haiying Shen

TL;DR
PecSched is a novel cluster scheduling system for LLM inference that efficiently handles input length heterogeneity through preemptive scheduling, significantly reducing delays for short requests while maintaining long request performance.
Contribution
It introduces preemptive scheduling with coordinated techniques and fast sequence parallelism to improve LLM inference cluster efficiency and fairness.
Findings
Reduces 99th percentile queue delay for short inputs by up to 92%
Increases throughput of short requests by up to 595%
Maintains comparable job completion times for long inputs
Abstract
The scaling of transformer-based Large Language Models (LLMs) has significantly expanded their context lengths, enabling applications where inputs exceed 100K tokens. Our analysis of a recent Azure LLM inference trace reveals a highly skewed long-tail distribution of input lengths, with approximately 80% of inputs shorter than 2K tokens. Long inputs constitute only a small fraction. Existing cluster-level LLM scheduling strategies, including First-In-First-Out (FIFO), reservation-based, and priority-based approaches, primarily target short-input requests with lengths below 2K and fail to address this heterogeneity, leading to inefficiencies such as head-of-line blocking, resource underutilization, and starvation of long-input requests. We propose PecSched, a Preemptive and Efficient Cluster SCHEDuling system for LLM inference. PecSched introduces the following key techniques: 1)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Water Quality Monitoring Technologies · Blind Source Separation Techniques
