Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

Rongzhi Li; Ruogu Du; Zefang Chu; Sida Zhao; Chunlei Han; Zuocheng Shi; Yiwen Shao; Huanle Han; Long Huang; Zherui Liu; Shufan Liu

arXiv:2508.19559·cs.DC·August 28, 2025

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

Rongzhi Li, Ruogu Du, Zefang Chu, Sida Zhao, Chunlei Han, Zuocheng Shi, Yiwen Shao, Huanle Han, Long Huang, Zherui Liu, Shufan Liu

PDF

TL;DR

HeteroScale is a novel autoscaling framework for disaggregated LLM inference that improves resource utilization and operational efficiency by coordinating hardware-aware scheduling with a metric-driven policy, validated in large-scale production.

Contribution

It introduces HeteroScale, the first large-scale empirical study and framework for coordinated autoscaling of disaggregated LLM inference, addressing hardware heterogeneity and network bottlenecks.

Findings

01

Increased GPU utilization by 26.6 percentage points.

02

Saved hundreds of thousands of GPU-hours daily.

03

Maintained service level objectives effectively.

Abstract

Serving Large Language Models (LLMs) is a GPU-intensive task where traditional autoscalers fall short, particularly for modern Prefill-Decode (P/D) disaggregated architectures. This architectural shift, while powerful, introduces significant operational challenges, including inefficient use of heterogeneous hardware, network bottlenecks, and critical imbalances between prefill and decode stages. We introduce HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D disaggregated serving. HeteroScale combines a topology-aware scheduler that adapts to heterogeneous hardware and network constraints with a novel metric-driven policy derived from the first large-scale empirical study of autoscaling signals in production. By leveraging a single, robust metric to jointly scale prefill and decode pools, HeteroScale maintains architectural balance while ensuring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.