Routing, Cascades, and User Choice for LLMs

Rafid Mahmood

arXiv:2602.09902·cs.GT·February 11, 2026

Routing, Cascades, and User Choice for LLMs

Rafid Mahmood

PDF

Open Access 3 Reviews

TL;DR

This paper models the interaction between LLM providers and users, analyzing how routing and cascading decisions impact user utility and provider costs, revealing conditions for optimal policies and potential misalignments.

Contribution

It introduces a game-theoretic framework for LLM routing and user behavior, providing insights into optimal policies and misalignment issues.

Findings

01

Optimal routing often involves static policies without cascading.

02

Misalignment occurs when user and provider utility rankings differ.

03

Throttling can reduce costs but also depress user utility.

Abstract

To mitigate the trade-offs between performance and costs, LLM providers route user tasks to different models based on task difficulty and latency. We study the effect of LLM routing with respect to user behavior. We propose a game between an LLM provider with two models (standard and reasoning) and a user who can re-prompt or abandon tasks if the routed model cannot solve them. The user's goal is to maximize their utility minus the delay from using the model, while the provider minimizes the cost of servicing the user. We solve this Stackelberg game by fully characterizing the user best response and simplifying the provider problem. We observe that in nearly all cases, the optimal routing policy involves a static policy with no cascading that depends on the expected utility of the models to the user. Furthermore, we reveal a misalignment gap between the provider-optimal and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 3

Strengths

1. The high level problem of learning optimal routing strategies is quite well motivated. 2. The formalism is thorough, though it could be quite dense at some points.

Weaknesses

1. In general, I have quite a few concerns about how realistic the whole setup is. The tasks that the paper defines seem a bit different from what LLM users encounter in practice. The paper should describe (i) Why is the monetary cost to the user is not modeled? (ii) The user churning with some predefined probability seems plausible. But during deployment, users mostly cannot check the accuracy on each individual example. So is the framework here meant to be more suitable for scenarios when user

Reviewer 02Rating 8Confidence 2

Strengths

I very much enjoyed reading this paper! The authors' formalization of the problem of user and model deployer interaction as it relates to what model is served is a highly (and increasingly) important to today's consumer AI dynamic. I found the connection to game theory quite creative, and I learned quite a bit from reading the paper. The authors' novel insights about possible gameification that model deployers could engage in, given such a game (re throttling latency) are likely quite a valuable

Weaknesses

I found quite few weaknesses in the work; however, I may have missed something in the mathematics. As someone a bit weaker on the theory-side, I did find some of the theoretical discourse quite dense and a little convoluted -- especially section 3 (but this may be my own naivete --- indicated in my lower confidence score). As noted above, the authors seem quite upfront on their limitations (I would be interested in settings where the user may not be aware of the routing policy or s). More mi

Reviewer 03Rating 4Confidence 3

Strengths

Originality:The paper introduces a novel behavioral–economic perspective on LLM routing. While prior routing work (e.g., Chen et al., 2023; Ding et al., 2024; Hu et al., 2024) focuses on minimizing cost–latency trade-offs, this paper uniquely models strategic user response via a multi-round prompting game (Section 3). This Stackelberg formulation, with users as rational agents, represents a conceptual advance that bridges operations research and AI system design. Quality: The analysis is mathem

Weaknesses

Empirical validation: The work is entirely theoretical. While this is appropriate for a conceptual contribution, the claims about user patience and latency manipulation (Figure 5 Right) would benefit from empirical support, e.g., simulations or user–provider experiments. Limited model diversity: The analysis considers only two models (standard vs. reasoning). While the authors acknowledge this in the conclusion (Section 7), the extension to $n$ models could meaningfully affect equilibrium behav

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPeer-to-Peer Network Technologies · Network Traffic and Congestion Control · Caching and Content Delivery