PolyServe: Efficient Multi-SLO Serving at Scale
Kan Zhu, Haiyang Shi, Le Xu, Jiaxin Shan, Arvind Krishnamurthy, Baris Kasikci, Liguang Xie

TL;DR
PolyServe introduces a multi-SLO scheduling system for large language model serving that improves throughput and tail latency management by grouping requests based on latency requirements and dynamic scheduling.
Contribution
It presents a novel multi-SLO scheduling policy that effectively manages diverse latency requirements and enhances server utilization at scale.
Findings
Achieves 1.23x goodput gain over existing policies.
Attains up to 92.5% of optimal goodput.
Effectively manages tail latency through request-aware scheduling.
Abstract
Advances in Large Language Models (LLMs) have led to a surge of LLM-powered applications. These applications have diverse token-generation latency requirements. As a result, simply classifying workloads as latency-sensitive (LS) or best-effort (BE) overlooks the nuances within the latency-sensitive category and results in suboptimal user experiences and scheduling opportunities. However, efficiently serving requests with multiple SLO requirements poses significant challenges. First, all requests within a batch generate new tokens simultaneously, which can misalign them with their distinct SLO requirements. Moreover, while existing systems focus on auto-scaling for handling various overall request rates, the diversity of SLOs necessitates fine-grained auto-scaling among these SLO tiers. Finally, unlike LS/BE scenarios, where BE requests can be aborted at any time to ensure the SLO…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEnergy Efficient Wireless Sensor Networks
