Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference
Rongzhi Li, Ruogu Du, Zefang Chu, Sida Zhao, Chunlei Han, Zuocheng Shi, Yiwen Shao, Huanle Han, Long Huang, Zherui Liu, Shufan Liu

TL;DR
HeteroScale is a novel autoscaling framework for disaggregated LLM inference that improves resource utilization and operational efficiency by coordinating hardware-aware scheduling with a metric-driven policy, validated in large-scale production.
Contribution
It introduces HeteroScale, the first large-scale empirical study and framework for coordinated autoscaling of disaggregated LLM inference, addressing hardware heterogeneity and network bottlenecks.
Findings
Increased GPU utilization by 26.6 percentage points.
Saved hundreds of thousands of GPU-hours daily.
Maintained service level objectives effectively.
Abstract
Serving Large Language Models (LLMs) is a GPU-intensive task where traditional autoscalers fall short, particularly for modern Prefill-Decode (P/D) disaggregated architectures. This architectural shift, while powerful, introduces significant operational challenges, including inefficient use of heterogeneous hardware, network bottlenecks, and critical imbalances between prefill and decode stages. We introduce HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D disaggregated serving. HeteroScale combines a topology-aware scheduler that adapts to heterogeneous hardware and network constraints with a novel metric-driven policy derived from the first large-scale empirical study of autoscaling signals in production. By leveraging a single, robust metric to jointly scale prefill and decode pools, HeteroScale maintains architectural balance while ensuring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
