How to Evaluate Reward Models for RLHF
Evan Frick, Tianle Li, Connor Chen, Wei-Lin Chiang, Anastasios N., Angelopoulos, Jiantao Jiao, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica

TL;DR
This paper introduces PPE, a benchmark for evaluating reward models in RLHF by using proxy tasks to predict downstream language model performance, reducing the need for costly full RLHF training.
Contribution
The paper presents PPE, the first reward model benchmark explicitly linked to real-world human preference outcomes, and develops proxy tasks to efficiently evaluate reward models.
Findings
Proxy tasks correlate with downstream RLHF performance.
PPE provides a practical benchmark for reward model evaluation.
Open-source code enables community use and development.
Abstract
We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback). The gold-standard approach is to run a full RLHF training pipeline and directly probe downstream LLM performance. However, this process is prohibitively expensive. To address this, we build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks. These proxy tasks consist of a large-scale human preference and a verifiable correctness preference dataset, in which we measure 12 metrics across 12 domains. To investigate which reward model metrics are most correlated to gold-standard RLHF outcomes, we launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth. Ultimately, we compile our…
Peer Reviews
Decision·ICLR 2025 Poster
- This paper addresses an important topic: the effectiveness of a reward model should ultimately be assessed by the performance of LLMs after RLHF. The study offers valuable insights into the evaluation of existing reward models. - The paper includes extensive experiments and evaluates models using real-world human preferences.
- The study employs DPO as the RLHF algorithm to evaluate post-RLHF performance. However, the offline DPO algorithm may face generalization issues. Utilizing an online PPO algorithm to obtain post-RLHF models and assess their performance would offer a more comprehensive evaluation of various reward models. This additional experimentation is essential to support the authors’ core claim of being “the first reward model benchmark explicitly linked to post-RLHF performance.” - The paper utilizes Lla
- Achieves 77% Pearson correlation with human preference ELO. - Well motivates why best-of-K scores, row-wise Pearson correlation, & accuracy are relevant for downstream RLHF success. - Human preference dataset is large and useful in producing statistically significant outcomes (16,038 labeled preference pairs).
- Underdiscusses the possibility that the RM evals are computationally very expensive to run in the the PPE framework. Best-of-K performance curves and pairwise accuracy have a huge computational burden, especially as K increases. Please provide runtime estimates for different K values or discuss strategies for making the evaluations more computationally efficient. - Does not discusses the findings (e.g. low quantile aggregation correlation) in any depth. Please explore potential explanations fo
Strengths: - The problem is highly relevant, the evaluation of reward models and RLHF workflows is a challenging problem - The collection of human preferences and release as open-source is a valuable contribution to the community, the amount of collected human data is impressive (especially including multi-lingual queries) - The approach of grounding metrics with human data is highly appreciated, and can be a valuable tool for future reward model development - The selection of models, prompts an
Weaknesses - Some of the results seem pretty surprising (e.g. the low correlation with reward bench), for me it’s a bit unclear why we observe them. It would be great if the authors could try to reason more about these observations (as outlined below) - Clarity of writing and presentation could be improved - "Reward Model Accuracy" as the best metric is not a novel contribution in itself, you may argue that it’s good to have it validated towards human preferences - Overall, i find the findings a
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRisk and Safety Analysis
