Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for   LLMs

Ferdi Kossmann; Bruce Fontaine; Daya Khudia; Michael Cafarella; Samuel; Madden

arXiv:2410.17840·cs.LG·January 29, 2025

Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs

Ferdi Kossmann, Bruce Fontaine, Daya Khudia, Michael Cafarella, Samuel, Madden

PDF

Open Access

TL;DR

This paper reviews existing scheduling methods for Large Language Model serving systems, highlighting the trade-offs between complexity and performance, and introduces two new simple yet effective scheduling techniques that outperform current methods on real workloads.

Contribution

The paper surveys existing scheduling techniques for LLM serving systems and proposes two novel, easy-to-implement scheduling methods that improve performance on production workloads.

Findings

01

Literature schedulers are complex but effective.

02

Practical schedulers are simple but leave performance gains untapped.

03

The proposed techniques outperform existing methods on real workload traces.

Abstract

Serving systems for Large Language Models (LLMs) improve throughput by processing several requests concurrently. However, multiplexing hardware resources between concurrent requests involves non-trivial scheduling decisions. Practical serving systems typically implement these decisions at two levels: First, a load balancer routes requests to different servers which each hold a replica of the LLM. Then, on each server, an engine-level scheduler decides when to run a request, or when to queue or preempt it. Improved scheduling policies may benefit a wide range of LLM deployments and can often be implemented as "drop-in replacements" to a system's current policy. In this work, we survey scheduling techniques from the literature and from practical serving systems. We find that schedulers from the literature often achieve good performance but introduce significant complexity. In contrast,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies