Queue management for slo-oriented large language model serving

Archit Patke; Dhemath Reddy; Saurabh Jha; Haoran Qiu; Christian Pinto,; Chandra Narayanaswami; Zbigniew Kalbarczyk; Ravishankar Iyer

arXiv:2407.00047·cs.DC·February 26, 2025

Queue management for slo-oriented large language model serving

Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto,, Chandra Narayanaswami, Zbigniew Kalbarczyk, Ravishankar Iyer

PDF

Open Access 1 Repo

TL;DR

QLM is a queue management system for large language model serving that optimizes request scheduling to improve SLO adherence and resource utilization, especially for mixed request types with different latency requirements.

Contribution

The paper introduces QLM, a novel queue management system that uses a Request Waiting Time estimator and global scheduling to optimize LLM request handling across diverse SLOs.

Findings

01

Improves SLO attainment by 40-90%.

02

Enhances throughput by 20-400%.

03

Maintains or improves device utilization.

Abstract

Large language model (LLM) serving is becoming an increasingly critical workload for cloud providers. Existing LLM serving systems focus on interactive requests, such as chatbots and coding assistants, with tight latency SLO requirements. However, when such systems execute batch requests that have relaxed SLOs along with interactive requests, it leads to poor multiplexing and inefficient resource utilization. To address these challenges, we propose QLM, a queue management system for LLM serving. QLM maintains batch and interactive requests across different models and SLOs in a request queue. Optimal ordering of the request queue is critical to maintain SLOs while ensuring high resource utilization. To generate this optimal ordering, QLM uses a Request Waiting Time (RWT) Estimator that estimates the waiting times for requests in the request queue. These estimates are used by a global…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qlm-project/qlm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsFocus