TL;DR
This paper introduces StageRoute, an online algorithm for deploying and routing among streaming large language models, achieving near-optimal regret bounds and effective empirical performance under strict budget constraints.
Contribution
The paper presents a novel hierarchical online algorithm for joint deployment and routing of streaming LLMs, with proven near-optimal regret bounds and practical validation.
Findings
Achieves a regret of ( T^{2/3}) with a matching lower bound.
Effectively tracks a strong oracle under tight budgets.
Demonstrates strong empirical performance across diverse workloads.
Abstract
The rapid pace at which new large language models (LLMs) appear, and older ones become obsolete, forces providers to manage a streaming inventory under a strict concurrency cap and per-query cost budgets. We cast this as an online decision problem that couples stage-wise deployment (at fixed maintenance windows) with per-query routing among live models. We introduce StageRoute, a hierarchical algorithm that (i) optimistically selects up to models for the next stage using reward upper-confidence and cost lower-confidence bounds, and (ii) routes each incoming query by solving a budget- and throughput-constrained bandit subproblem over the deployed set. We prove a regret of with a matching lower bound, establishing near-optimality, and validate the theory empirically: StageRoute tracks a strong oracle under tight budgets across diverse workloads.
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper looks at a very practical problem, how does one choose which models to deploy in a tight setup? This is very useful for practical settings 2. I carefully checked several theorems and lemmas (not all), they look correct. 3. The experiments are quite comprehensive. In figure 3, it is quite clear that cumulative regret is fairly low for StageRoute. The sensitivity analysis is quite nice too.
1. The paper does not do a great job of connecting with existing explore-exploit literature. Please expand upon the section why existing analyses does not apply. This bleeds into the experiments. Why compare only with greedy and random? This to me is the biggest drawback of this paper, there is a ton of literature in the MAB space that could have been used to create stronger baselines 2. In eq(3), it is important for the authors to describe the details of the problem. Is it a MIP? IP? LP? They b
1. The paper addresses a timely and practical challenge in LLM serving systems, where cost and scalability are key bottlenecks. 2. The theoretical analysis is rigorous and clarifies the trade-off between adaptivity and learnability. 3. The presentation and organization have improved since the NeurIPS version, with better discussion of parameter dependencies and clearer empirical exposition.
*Disclosure:* I also reviewed this paper in its earlier NeurIPS submission. Compared with that version, I find that the current paper has made several substantial improvements: 1. The authors now clearly explain why the dependence on $K$ becomes invalid when $K \ge O(T^{1/3})$, addressing my previous concern both theoretically (line 317) and empirically (line 466). 2. While the algorithmic design still combines several existing ideas, the paper now provides a more thoughtful discussion of how t
* Adapting LLM routing to an ever-changing landscape of models is a significant contribution. * StageRoute includes a minimax optimality guarantee that also provides practical guidance on the selection of $K$ (the number of model deployment stages).
* The theoretical guarantees seem to require that $r_t$ and $c_t$ are drawn from model-dependendent but *query* independent distributions. It is unclear to me that this assumption makes sense in the context of studying *per-query* routing; in fact if this does hold why do any per-query routing at all? * The empirical base lines comparisons seem lacking, the authors only compare to a greedy and random baseline, but no prior routing methods.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
