RACER: Risk-Aware Calibrated Efficient Routing for Large Language Models
Sai Hao, Hao Zeng, Hongxin Wei, Bingyi Jing

TL;DR
RACER is a novel routing method for large language models that minimizes misrouting risk by outputting calibrated model sets, improving accuracy and cost-efficiency in multi-model systems.
Contribution
It formulates LLM routing as the $$-VOR problem and introduces RACER, a risk-aware, calibrated approach that constructs nested model sets with theoretical risk control guarantees.
Findings
RACER achieves rigorous distribution-free risk control.
It consistently improves downstream accuracy across benchmarks.
The method effectively balances model set size and misrouting risk.
Abstract
Efficiently routing queries to the optimal large language model (LLM) is crucial for optimizing the cost-performance trade-off in multi-model systems. However, most existing routers rely on single-model selection, making them susceptible to misrouting. In this work, we formulate LLM routing as the -VOR problem to minimize expected set size while controlling the misrouting risk, and propose a novel method -- RACER, extending base routers to output model sets that can be subsequently aggregated for improved output. In particular, RACER constructs nested model sets via augmented scoring and utilizes finite-sample concentration bounds to calibrate a threshold that allows for both variable set sizes and abstention. We theoretically prove that RACER achieves rigorous distribution-free risk control on unseen test data in a post-hoc and model-agnostic manner. Extensive experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware-Defined Networks and 5G · Natural Language Processing Techniques · Advanced Neural Network Applications
