Hierarchical Autoscaling for Large Language Model Serving with Chiron

Archit Patke; Dhemath Reddy; Saurabh Jha; Chandra Narayanaswami,; Zbigniew Kalbarczyk; Ravishankar Iyer

arXiv:2501.08090·cs.DC·January 15, 2025

Hierarchical Autoscaling for Large Language Model Serving with Chiron

Archit Patke, Dhemath Reddy, Saurabh Jha, Chandra Narayanaswami,, Zbigniew Kalbarczyk, Ravishankar Iyer

PDF

Open Access

TL;DR

This paper introduces Chiron, a hierarchical autoscaler for large language model serving that improves resource utilization and SLO adherence by considering request-specific SLOs and employing backpressure estimation.

Contribution

Chiron is a novel autoscaling approach that incorporates request SLOs and hierarchical backpressure to optimize LLM serving efficiency.

Findings

01

Chiron achieves up to 90% higher SLO attainment.

02

Chiron improves GPU efficiency by up to 70%.

03

Outperforms existing autoscaling solutions in LLM serving.

Abstract

Large language model (LLM) serving is becoming an increasingly important workload for cloud providers. Based on performance SLO requirements, LLM inference requests can be divided into (a) interactive requests that have tight SLOs in the order of seconds, and (b) batch requests that have relaxed SLO in the order of minutes to hours. These SLOs can degrade based on the arrival rates, multiplexing, and configuration parameters, thus necessitating the use of resource autoscaling on serving instances and their batch sizes. However, previous autoscalers for LLM serving do not consider request SLOs leading to unnecessary scaling and resource under-utilization. To address these limitations, we introduce Chiron, an autoscaler that uses the idea of hierarchical backpressure estimated using queue size, utilization, and SLOs. Our experiments show that Chiron achieves up to 90% higher SLO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Computational Physics and Python Applications