Universal Model Routing for Efficient LLM Inference

Wittawat Jitkrittum; Harikrishna Narasimhan; Ankit Singh Rawat; Jeevesh Juneja; Congchao Wang; Zifeng Wang; Alec Go; Chen-Yu Lee; Pradeep Shenoy; Rina Panigrahy; Aditya Krishna Menon; Sanjiv Kumar

arXiv:2502.08773·cs.CL·July 23, 2025

Universal Model Routing for Efficient LLM Inference

Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, Sanjiv Kumar

PDF

Open Access 3 Reviews

TL;DR

UniRoute introduces a dynamic model routing method for large language models that efficiently handles unseen models at test time by representing LLMs as feature vectors and using clustering techniques.

Contribution

The paper presents UniRoute, a novel approach for dynamic LLM routing that generalizes to unseen models using feature-based representations and clustering, with theoretical guarantees.

Findings

01

Effective routing among 30+ unseen LLMs

02

Outperforms fixed pool routing methods

03

Theoretically grounded with excess risk bounds

Abstract

Model routing is a simple technique for reducing the inference cost of large language models (LLMs), wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose UniRoute, a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective instantiations of UniRoute, relying on cluster-based routing and a learned cluster map respectively. We show that these are estimates of a theoretically optimal routing rule, and quantify their errors via an excess risk bound. Experiments on a range of public benchmarks show the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1. The set of evaluations is very broad and shows consistent gains against reasonable baselines 2. The test-time approach is amenable to many real use cases 3. The method is cognisant of cost, as opposed to naively just optimising for performance

Weaknesses

1. The set of metrics is relatively narrow; it would be good to cover generation metrics as some other methods do 2. It is unclear how robust this is to shifts in distribution compared to the validation set

Reviewer 02Rating 8Confidence 3

Strengths

- The paper addresses a realistic and under-explored setting. Most prior work focuses on a fixed pool of LLMs and require retraining when a new model is added, UniRoute explicitly designed to handle a dynamic pool of LLMs. - To my knowledge, the solution proposed by the authors is novel and quite simplistic. I think that the cluster-based example setups are practical and easy to implement. It is also cost-effective to adapt to new models. - The paper gives an explicit optimal routing rule that s

Weaknesses

- The given clustering-based examples rely on the representativeness of the validation dataset. While this works, the paper gives limited insight into sensitivity. What happens if clusters are not well aligned with the task structure? - The proposed method assumes that you have access to a validation dataset with labels. How could this be generalized to noisy labels for a validation dataset or a validation dataset without labels? (I think that an intuitive answer would be enough instead of runn

Reviewer 03Rating 6Confidence 3

Strengths

- Relevant problem: tackles a practical, current task where LLM/agent methods are still brittle, the problem choice is well motivated. - The components fit together: the modeling choices are consistent with the objective, and the pipeline is implementable without exotic assumptions. - Across multiple datasets/settings, the method shows consistent improvements over the stated baselines, not just one cherry-picked case. - There is at least some attempt at digging into why it works (error breakdown

Weaknesses

- Ablations are thin: key modules are turned on/off only in one setting; we don’t see if the effect is stable across datasets/scales. A 2–3 row ablation table per main component would already help. - Limited robustness reporting: no real stress test (distribution shift, noisier inputs, or lower-resource regime). Right now, the method looks tuned to a friendly setup. - Clarity on compute/cost: method adds some overhead but the paper doesn’t quantify it clearly; for adoption, people will want to k

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Advanced Data Compression Techniques · Power Systems and Technologies

MethodsSparse Evolutionary Training · Focus