Expected Reward Prediction, with Applications to Model Routing

Kenan Hasanaliyev; Silas Alberti; Jenny Hamer; Dheeraj Rajagopal; Kevin Robinson; Jasper Snoek; Victor Veitch; Alexander Nicholas D'Amour

arXiv:2603.20217·cs.CL·March 24, 2026

Expected Reward Prediction, with Applications to Model Routing

Kenan Hasanaliyev, Silas Alberti, Jenny Hamer, Dheeraj Rajagopal, Kevin Robinson, Jasper Snoek, Victor Veitch, Alexander Nicholas D'Amour

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a method to predict the expected reward of language models for prompts before response generation, enabling efficient model routing to maximize reward and reduce computational costs.

Contribution

It presents a novel approach to predict expected rewards of LLMs for prompts, facilitating effective model routing at inference time.

Findings

01

Expected reward predictions are accurate and discriminative.

02

The routing method outperforms baseline strategies.

03

The approach is easily extensible with new models.

Abstract

Reward models are a standard tool to score responses from LLMs. Reward models are built to rank responses to a fixed prompt sampled from a single model, for example to choose the best of n sampled responses. In this paper, we study whether scores from response-level reward models lifted to score a model's suitability for a prompt, prior to seeing responses from that model. Specifically, we show that it is straightforward to predict the expected reward that an LLM would earn from the reward model under repeated sampling. Further, we show that these expected reward predictions are precise and discriminative enough to support an application to a model routing protocol that routes prompts to models at inference time to maximize reward while controlling computational cost. We demonstrate the performance of this routing procedure on the open-perfectblend dataset, using a model pool composed…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- The technical contents are clearly written, and the paper is generally easy to follow. - The experiment scale is great, covering a good number of LLMs, reward models, and routing baselines. - The empirical observations that the expected reward is readily predictable, and that the rewards (using the same reward model) are comparable across tasks and across LLMs, are quite valuable for the research community. - The proposed method requires minimum computation *during test time*.

Weaknesses

- Paper clarity: It would make sense to move more routing results with comparisons against the baselines to the main text of the paper. Currently there is only one figure with one reward model in Figure 1(b). - Discussions on limitations and future works are quite thin. - The non-trivial train-time computational costs of the proposed method for calculating expected reward are not adequately discussed. It seems to me that the predictability of the expected reward comes from the availability of a

Reviewer 02Rating 4Confidence 3

Strengths

- The paper is well written and generally clear. - It’s great to see that a relatively small model is able to accurately predict expected rewards. - The simplicity of the approach and some interesting findings such as that there is enough variability between models.

Weaknesses

- Motivation: the paper doesn’t convincingly motivate why it is advantageous to predict the expected reward instead of the actual RM rewards. - Practical applicability/downstream performance: I was missing experiments that showed how this routing setup would actually improve post-training of language models on downstream evaluations. All evaluations in the paper are intrinsic evaluations, but it would be interesting to see actual applicability to concrete tasks. - Results: Some of the metrics in

Reviewer 03Rating 8Confidence 4

Strengths

This paper finds an unexpected phenomenon, and has well-designed experiments to critically analyze this phenomenon. The fact that rewards are reasonably predictable for given models has interesting implications, both for model routing as shown in this paper, and other potential applications (as they mention), such as test-time modifications to system prompts/etc, and has potential implications for building simpler or more efficient classical RMs (e.g. distilling a RM into a smaller classifier, a

Weaknesses

While the experiments make sense to me, I'd also be interested to see how well this works for creating a reasonable ensemble of models, and how they perform on different benchmarks. For example, we could see what the Alpaca Eval score is when routing prompts to different models, or how this setup performs on benchmarks targeting math, or instruction following. I don't think those would be required for this paper, but they would be 1) very interesting, and 2) would show the further applicability

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware-Defined Networks and 5G · Internet Traffic Analysis and Secure E-voting · Network Traffic and Congestion Control