Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference

Jiaming Cheng; Duong Tung Nguyen

arXiv:2604.07472·cs.LG·April 10, 2026

Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference

Jiaming Cheng, Duong Tung Nguyen

PDF

TL;DR

This paper introduces scalable heuristics for large-scale heterogeneous LLM deployment, optimizing resource allocation under strict latency, accuracy, and budget constraints, significantly outperforming exact methods in speed and robustness.

Contribution

It presents two novel constraint-aware heuristics, GH and AGH, for efficient LLM resource allocation that maintain feasibility and near-optimality at scale.

Findings

01

Both heuristics produce feasible solutions in under one second.

02

AGH approaches optimal cost closely and is over 260x faster on large instances.

03

AGH maintains stable costs and SLO adherence under stress tests.

Abstract

Deploying large language model (LLM) inference at scale requires jointly selecting base models, provisioning heterogeneous GPUs, configuring parallelism, and distributing workloads under tight latency, accuracy, and budget constraints. Exact mixed-integer linear programming (MILP) approaches guarantee optimality but scale poorly. We propose two constraint-aware heuristics: a Greedy Heuristic (GH) for single-pass allocation, and an Adaptive Greedy Heuristic (AGH) that enhances GH via multi-start construction, relocate-based local search, and GPU consolidation. Three constraint-aware mechanisms -- TP-aware feasibility selection, cost-per-effective-coverage ranking, and TP upgrade -- ensure feasibility under tightly coupled memory, delay, error, and budget constraints. On workloads calibrated with the Azure LLM Inference Trace (2025), both heuristics produce feasible solutions in under one…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.