inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

Huamin Chen; Xunzhuo Liu; Yuhan Liu; Junchen Jiang; Bowei He; Xue Liu

arXiv:2603.16054·cs.DC·March 18, 2026

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

PDF

TL;DR

This paper introduces inference-fleet-sim, a combined queueing-theory and simulation tool for optimizing GPU fleet size, configuration, and routing for large language model inference, addressing a complex, previously unformulated problem.

Contribution

It presents a novel approach integrating analytical queueing models with discrete-event simulation to optimize GPU fleet planning for LLM inference.

Findings

01

Accurately predicts optimal GPU configurations across diverse workloads.

02

Identifies cost-effective GPU deployment strategies that simple analysis misses.

03

Demonstrates the necessity of joint simulation for correct fleet sizing and routing.

Abstract

Sizing a GPU fleet for LLM inference is harder than it looks. The obvious questions -- how many GPUs, which type, where to split a two-pool fleet -- have no closed-form answers. They depend on the full token-length distribution, the routing policy, and queueing dynamics that turn ugly under heavy-tailed workloads. Existing tools optimize per-engine configuration for a fixed GPU count; none of them address the upstream question of how many GPUs to buy and how to arrange them. inference-fleet-sim fills that gap. It combines analytical M/G/c queueing with discrete-event simulation (DES) to find the minimum-cost fleet configuration that empirically meets a P99 TTFT SLO. It includes a physics-informed GPU performance model covering A10G, A100, and H100 across monolithic, two-pool-routed, and disaggregated topologies, all without requiring access to real hardware. We run the tool on seven…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.