Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

Xunzhuo Liu; Bowei He; Xue Liu; Andy Luo; Haichen Zhang; Huamin Chen

arXiv:2604.08075·cs.CL·April 10, 2026

Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

PDF

TL;DR

This paper introduces a dual-pool token-budget routing method for LLM serving that significantly improves cost efficiency and reliability by dynamically routing requests based on token budget estimates.

Contribution

It proposes a lightweight dispatch mechanism that partitions a homogeneous fleet into specialized pools, optimizing resource utilization and reducing costs without complex tokenization.

Findings

01

Reduces GPU-hours by 31-42%, saving up to $2.86M annually.

02

Lowers preemption rates by 5.4× and improves P99 TTFT by 6%.

03

Projects $15.4M annual savings in large-scale deployments.

Abstract

Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency. In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8 $\times$ throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. We identify a common root cause for these inefficiencies: configuration-traffic mismatch. We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool and a high-capacity long-context pool. Each request is routed based on its estimated total token budget, computed using a per-category bytes-to-token ratio that is learned online via exponential moving average from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.