inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference
Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

TL;DR
This paper introduces inference-fleet-sim, a combined queueing-theory and simulation tool for optimizing GPU fleet size, configuration, and routing for large language model inference, addressing a complex, previously unformulated problem.
Contribution
It presents a novel approach integrating analytical queueing models with discrete-event simulation to optimize GPU fleet planning for LLM inference.
Findings
Accurately predicts optimal GPU configurations across diverse workloads.
Identifies cost-effective GPU deployment strategies that simple analysis misses.
Demonstrates the necessity of joint simulation for correct fleet sizing and routing.
Abstract
Sizing a GPU fleet for LLM inference is harder than it looks. The obvious questions -- how many GPUs, which type, where to split a two-pool fleet -- have no closed-form answers. They depend on the full token-length distribution, the routing policy, and queueing dynamics that turn ugly under heavy-tailed workloads. Existing tools optimize per-engine configuration for a fixed GPU count; none of them address the upstream question of how many GPUs to buy and how to arrange them. inference-fleet-sim fills that gap. It combines analytical M/G/c queueing with discrete-event simulation (DES) to find the minimum-cost fleet configuration that empirically meets a P99 TTFT SLO. It includes a physics-informed GPU performance model covering A10G, A100, and H100 across monolithic, two-pool-routed, and disaggregated topologies, all without requiring access to real hardware. We run the tool on seven…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
