Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference
Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu

TL;DR
This paper introduces a token-budget-aware routing method for vLLM fleets that significantly reduces GPU costs by efficiently directing requests based on estimated token budgets, improving resource utilization.
Contribution
It proposes a novel online learning-based routing algorithm that dynamically allocates requests to different pools, optimizing GPU usage without requiring a tokenizer.
Findings
Reduces GPU instances by 17-39% on real-world traces.
Predicts fleet savings using a simple cost model based on traffic fraction and throughput gain.
Achieves substantial cost savings in large-scale LLM inference scenarios.
Abstract
Production vLLM fleets provision every instance for worst-case context length, wasting 4-8x concurrency on the 80-95% of requests that are short and simultaneously triggering KV-cache failures -- OOM crashes, preemption storms, and request rejections. Both problems share a single root cause: configuration-traffic mismatch. We propose token-budget-aware pool routing: estimate each request's total token budget using a self-calibrating per-category bytes-per-token ratio, then dispatch it to one of two vLLM pools -- a high-throughput short pool or a high-capacity long pool -- each right-sized for its workload class. The ratio is learned online via exponential moving average from usage.prompt_tokens feedback, requiring no tokenizer. A closed-form cost model, savings = alpha * (1 - 1/rho), predicts fleet-level GPU savings from two observable quantities: the short-traffic fraction alpha and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
