PolyServe: Efficient Multi-SLO Serving at Scale

Kan Zhu; Haiyang Shi; Le Xu; Jiaxin Shan; Arvind Krishnamurthy; Baris Kasikci; Liguang Xie

arXiv:2507.17769·cs.DC·July 25, 2025

PolyServe: Efficient Multi-SLO Serving at Scale

Kan Zhu, Haiyang Shi, Le Xu, Jiaxin Shan, Arvind Krishnamurthy, Baris Kasikci, Liguang Xie

PDF

Open Access

TL;DR

PolyServe introduces a multi-SLO scheduling system for large language model serving that improves throughput and tail latency management by grouping requests based on latency requirements and dynamic scheduling.

Contribution

It presents a novel multi-SLO scheduling policy that effectively manages diverse latency requirements and enhances server utilization at scale.

Findings

01

Achieves 1.23x goodput gain over existing policies.

02

Attains up to 92.5% of optimal goodput.

03

Effectively manages tail latency through request-aware scheduling.

Abstract

Advances in Large Language Models (LLMs) have led to a surge of LLM-powered applications. These applications have diverse token-generation latency requirements. As a result, simply classifying workloads as latency-sensitive (LS) or best-effort (BE) overlooks the nuances within the latency-sensitive category and results in suboptimal user experiences and scheduling opportunities. However, efficiently serving requests with multiple SLO requirements poses significant challenges. First, all requests within a batch generate new tokens simultaneously, which can misalign them with their distinct SLO requirements. Moreover, while existing systems focus on auto-scaling for handling various overall request rates, the diversity of SLOs necessitates fine-grained auto-scaling among these SLO tiers. Finally, unlike LS/BE scenarios, where BE requests can be aborted at any time to ensure the SLO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEnergy Efficient Wireless Sensor Networks