Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving
Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

TL;DR
This paper introduces a dual-pool token-budget routing method for LLM serving that significantly improves cost efficiency and reliability by dynamically routing requests based on token budget estimates.
Contribution
It proposes a lightweight dispatch mechanism that partitions a homogeneous fleet into specialized pools, optimizing resource utilization and reducing costs without complex tokenization.
Findings
Reduces GPU-hours by 31-42%, saving up to $2.86M annually.
Lowers preemption rates by 5.4× and improves P99 TTFT by 6%.
Projects $15.4M annual savings in large-scale deployments.
Abstract
Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency. In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8 throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. We identify a common root cause for these inefficiencies: configuration-traffic mismatch. We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool and a high-capacity long-context pool. Each request is routed based on its estimated total token budget, computed using a per-category bytes-to-token ratio that is learned online via exponential moving average from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
