PecSched: Preemptive and Efficient Cluster Scheduling for LLM Inference

Zeyu Zhang; Haiying Shen

arXiv:2409.15104·cs.DC·June 10, 2025

PecSched: Preemptive and Efficient Cluster Scheduling for LLM Inference

Zeyu Zhang, Haiying Shen

PDF

Open Access

TL;DR

PecSched is a novel cluster scheduling system for LLM inference that efficiently handles input length heterogeneity through preemptive scheduling, significantly reducing delays for short requests while maintaining long request performance.

Contribution

It introduces preemptive scheduling with coordinated techniques and fast sequence parallelism to improve LLM inference cluster efficiency and fairness.

Findings

01

Reduces 99th percentile queue delay for short inputs by up to 92%

02

Increases throughput of short requests by up to 595%

03

Maintains comparable job completion times for long inputs

Abstract

The scaling of transformer-based Large Language Models (LLMs) has significantly expanded their context lengths, enabling applications where inputs exceed 100K tokens. Our analysis of a recent Azure LLM inference trace reveals a highly skewed long-tail distribution of input lengths, with approximately 80% of inputs shorter than 2K tokens. Long inputs constitute only a small fraction. Existing cluster-level LLM scheduling strategies, including First-In-First-Out (FIFO), reservation-based, and priority-based approaches, primarily target short-input requests with lengths below 2K and fail to address this heterogeneity, leading to inefficiencies such as head-of-line blocking, resource underutilization, and starvation of long-input requests. We propose PecSched, a Preemptive and Efficient Cluster SCHEDuling system for LLM inference. PecSched introduces the following key techniques: 1)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Water Quality Monitoring Technologies · Blind Source Separation Techniques