Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference
Jiaming Cheng, Duong Tung Nguyen

TL;DR
This paper introduces scalable heuristics for large-scale heterogeneous LLM deployment, optimizing resource allocation under strict latency, accuracy, and budget constraints, significantly outperforming exact methods in speed and robustness.
Contribution
It presents two novel constraint-aware heuristics, GH and AGH, for efficient LLM resource allocation that maintain feasibility and near-optimality at scale.
Findings
Both heuristics produce feasible solutions in under one second.
AGH approaches optimal cost closely and is over 260x faster on large instances.
AGH maintains stable costs and SLO adherence under stress tests.
Abstract
Deploying large language model (LLM) inference at scale requires jointly selecting base models, provisioning heterogeneous GPUs, configuring parallelism, and distributing workloads under tight latency, accuracy, and budget constraints. Exact mixed-integer linear programming (MILP) approaches guarantee optimality but scale poorly. We propose two constraint-aware heuristics: a Greedy Heuristic (GH) for single-pass allocation, and an Adaptive Greedy Heuristic (AGH) that enhances GH via multi-start construction, relocate-based local search, and GPU consolidation. Three constraint-aware mechanisms -- TP-aware feasibility selection, cost-per-effective-coverage ranking, and TP upgrade -- ensure feasibility under tightly coupled memory, delay, error, and budget constraints. On workloads calibrated with the Azure LLM Inference Trace (2025), both heuristics produce feasible solutions in under one…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
