ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving
Annette Taberner-Miller

TL;DR
ParetoBandit is an adaptive routing algorithm for multi-model LLM serving that enforces cost limits and adapts to shifts in pricing and quality, improving efficiency and robustness.
Contribution
It introduces a novel online primal-dual budget pacer and geometric forgetting mechanism for effective, budget-aware, and adaptive model routing in non-stationary environments.
Findings
Maintains budget compliance within 0.4% on benchmark prompts.
Achieves up to +0.071 quality lift after shifts.
Integrates new models within approximately 142 steps.
Abstract
Multi-model LLM serving operates in a non-stationary, noisy environment: providers revise pricing, model quality can shift or regress without notice, and new models arrive regularly. More than a dozen recent methods have proposed learned routers to navigate the resulting quality--cost tradeoff across portfolios spanning a 530 cost range. Despite this activity, two gaps in the current solution space limit routing effectiveness under these conditions: no existing router enforces a dollar-denominated cost ceiling in closed loop over an open-ended request stream, and none provides principled online adaptation to post-deployment shifts in pricing or model quality. We present ParetoBandit, an open-source adaptive router built on cost-aware contextual bandits that addresses both gaps. Its core contributions are: (1) an online primal--dual budget pacer that enforces a per-request…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
