Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?
Xueru Wen, Jie Lou, Yaojie Lu, Hongyu Lin, Xing Yu, Xinyu Lu, Ben He,, Xianpei Han, Debing Zhang, Le Sun

TL;DR
This paper questions the effectiveness of current reward model evaluation methods, revealing that accuracy alone poorly predicts downstream policy performance and can be misleading due to overoptimization issues.
Contribution
It demonstrates through synthetic experiments that RM accuracy does not reliably indicate policy success and highlights the limitations of current evaluation practices.
Findings
Weak correlation between RM accuracy and policy performance
Measurement method significantly affects accuracy's predictive power
Accuracy can fail to detect RM overoptimization risks
Abstract
Reward Models (RMs) are crucial for aligning language models with human preferences. Currently, the evaluation of RMs depends on measuring accuracy against a validation set of manually annotated preference data. Although this method is straightforward and widely adopted, the relationship between RM accuracy and downstream policy performance remains under-explored. In this work, we conduct experiments in a synthetic setting to investigate how differences in RM measured by accuracy translate into gaps in optimized policy performance. Our findings reveal that while there is a weak positive correlation between accuracy and downstream performance, policies optimized towards RMs with similar accuracy can exhibit quite different performance. Moreover, we discover that the way of measuring accuracy significantly impacts its ability to predict the final policy performance. Through the lens of…
Peer Reviews
Decision·ICLR 2025 Spotlight
Overall, the paper presents empirical results that evaluate the metrics of RMs, which the community has previously intuited but not with rigorous scientific evaluation. The experiments are designed to test hypotheses incrementally, helping the RLHF community build a comprehensive body of knowledge on RM evaluation. - The paper tackles the question that the alignment research community needs to know the answer to. - Table 2 is interesting as it shows counterintuitive results. One would guess tha
I don't see any critical weaknesses for the paper. If I were to come up with the weaknesses: - Although policy regret is often referred to in the paper, its formal definition is not clearly stated. It would be better to have an equation defining the regret. Even if we do not have a way to compute it, the goal of the research is to estimate it so I would say that it is worth clarifying its definition formally. - The scope of the paper is to show that the current evaluation scheme is not enough (
The problem tackled is well introduced and presented. Contributions are clear. Having a synthetic and cheap setting that correlates with more realistic experiments is quite valuable for the research community (but see doubts expressed below). Some very interesting findings in this setting: * weak correlation between RM accuracy and downstream policy performance/regret * correlation increase by increasing the number of answers per prompt * correlation increase by picking answers based on their
The key limitation from my perspective is that we have currently no solid evidence that the study will translate to real settings: diverse, heterogenous RMs potentially trained on different datasets. While this would be infeasible to run as many experiments and ablations as in the synthetic setting, showing that there is imperfect correlation between RewardBench scores and relevant, downstream RLHF settings (more generally, that some findings from the synthetic study do replicate) would make the
* Understanding the relationship between our measures of reward quality and policy quality is of vital importance. * The goal and motivation for the work is well laid out at the start of the paper. The takeaways the reader should expect to have are presented from the start. * The use of a synthetic ground truth reward function is well motivated and contextualized. * Different methods for using a ground truth reward function (best-of-n and RL) are compared.
**High level** The main weaknesses for this paper are not overly large and mostly involve clarifications to the text. While not big changes, they are important to address. Some of the conclusions in the main body need to be walked back and made more nuance to fully reflect the presented results. The biggest missing result is information about the relationship between the ground truth reward model and the proxy reward model. **Details** * The experiments are set up such by taking a labelled p
Videos
Taxonomy
TopicsHealth Systems, Economic Evaluations, Quality of Life
MethodsSparse Evolutionary Training
