Queue management for slo-oriented large language model serving
Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto,, Chandra Narayanaswami, Zbigniew Kalbarczyk, Ravishankar Iyer

TL;DR
QLM is a queue management system for large language model serving that optimizes request scheduling to improve SLO adherence and resource utilization, especially for mixed request types with different latency requirements.
Contribution
The paper introduces QLM, a novel queue management system that uses a Request Waiting Time estimator and global scheduling to optimize LLM request handling across diverse SLOs.
Findings
Improves SLO attainment by 40-90%.
Enhances throughput by 20-400%.
Maintains or improves device utilization.
Abstract
Large language model (LLM) serving is becoming an increasingly critical workload for cloud providers. Existing LLM serving systems focus on interactive requests, such as chatbots and coding assistants, with tight latency SLO requirements. However, when such systems execute batch requests that have relaxed SLOs along with interactive requests, it leads to poor multiplexing and inefficient resource utilization. To address these challenges, we propose QLM, a queue management system for LLM serving. QLM maintains batch and interactive requests across different models and SLOs in a request queue. Optimal ordering of the request queue is critical to maintain SLOs while ensuring high resource utilization. To generate this optimal ordering, QLM uses a Request Waiting Time (RWT) Estimator that estimates the waiting times for requests in the request queue. These estimates are used by a global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsFocus
