ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

Annette Taberner-Miller

arXiv:2604.00136·cs.LG·April 15, 2026

ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

Annette Taberner-Miller

PDF

TL;DR

ParetoBandit is an adaptive routing algorithm for multi-model LLM serving that enforces cost limits and adapts to shifts in pricing and quality, improving efficiency and robustness.

Contribution

It introduces a novel online primal-dual budget pacer and geometric forgetting mechanism for effective, budget-aware, and adaptive model routing in non-stationary environments.

Findings

01

Maintains budget compliance within 0.4% on benchmark prompts.

02

Achieves up to +0.071 quality lift after shifts.

03

Integrates new models within approximately 142 steps.

Abstract

Multi-model LLM serving operates in a non-stationary, noisy environment: providers revise pricing, model quality can shift or regress without notice, and new models arrive regularly. More than a dozen recent methods have proposed learned routers to navigate the resulting quality--cost tradeoff across portfolios spanning a $\sim$ 530 $\times$ cost range. Despite this activity, two gaps in the current solution space limit routing effectiveness under these conditions: no existing router enforces a dollar-denominated cost ceiling in closed loop over an open-ended request stream, and none provides principled online adaptation to post-deployment shifts in pricing or model quality. We present ParetoBandit, an open-source adaptive router built on cost-aware contextual bandits that addresses both gaps. Its core contributions are: (1) an online primal--dual budget pacer that enforces a per-request…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.