From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models
Xingqi Cui, Chieh-Jan Mike Liang, Jiarong Xing, Haoran Qiu

TL;DR
This paper introduces an operator-level autoscaling framework for large generative models, improving efficiency and performance by exploiting the heterogeneity of model operators rather than scaling the entire model as a monolith.
Contribution
The paper proposes a novel operator-level autoscaling approach that enhances resource utilization and performance for large generative models by considering operator heterogeneity.
Findings
Up to 40% fewer GPUs used while maintaining SLOs
Achieves 1.6x higher throughput with 5% less energy under fixed resources
Operator-level scaling outperforms model-level approaches in efficiency
Abstract
Serving large generative models such as LLMs and multi- modal transformers requires balancing user-facing SLOs (e.g., time-to-first-token, time-between-tokens) with provider goals of efficiency and cost reduction. Existing solutions rely on static provisioning or model-level autoscaling, both of which treat the model as a monolith. This coarse-grained resource management leads to degraded performance or significant resource underutilization due to poor adaptability to dynamic inference traffic that is common online. The root cause of this inefficiency lies in the internal structure of generative models: they are executed as graphs of interconnected operators. Through detailed characterization and systematic analysis, we find that operators are heterogeneous in their compute and memory footprints and exhibit diverse sensitivity to workload and resource factors such as batch size,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Advanced Data Storage Technologies
