Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference

Huamin Chen; Xunzhuo Liu; Junchen Jiang; Bowei He; Xue Liu

arXiv:2604.09613·cs.DC·April 16, 2026

Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference

Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu

PDF

TL;DR

This paper introduces a token-budget-aware routing method for vLLM fleets that significantly reduces GPU costs by efficiently directing requests based on estimated token budgets, improving resource utilization.

Contribution

It proposes a novel online learning-based routing algorithm that dynamically allocates requests to different pools, optimizing GPU usage without requiring a tokenizer.

Findings

01

Reduces GPU instances by 17-39% on real-world traces.

02

Predicts fleet savings using a simple cost model based on traffic fraction and throughput gain.

03

Achieves substantial cost savings in large-scale LLM inference scenarios.

Abstract

Production vLLM fleets provision every instance for worst-case context length, wasting 4-8x concurrency on the 80-95% of requests that are short and simultaneously triggering KV-cache failures -- OOM crashes, preemption storms, and request rejections. Both problems share a single root cause: configuration-traffic mismatch. We propose token-budget-aware pool routing: estimate each request's total token budget using a self-calibrating per-category bytes-per-token ratio, then dispatch it to one of two vLLM pools -- a high-throughput short pool or a high-capacity long pool -- each right-sized for its workload class. The ratio is learned online via exponential moving average from usage.prompt_tokens feedback, requiring no tokenizer. A closed-form cost model, savings = alpha * (1 - 1/rho), predicts fleet-level GPU savings from two observable quantities: the short-traffic fraction alpha and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.