RMB: Comprehensively Benchmarking Reward Models in LLM Alignment
Enyu Zhou, Guodong Zheng, Binghai Wang, Zhiheng Xi, Shihan Dou, Rong, Bao, Wei Shen, Limao Xiong, Jessica Fan, Yurong Mou, Rui Zheng, Tao Gui, Qi, Zhang, Xuanjing Huang

TL;DR
This paper introduces RMB, a comprehensive benchmark for reward models in LLM alignment, covering diverse real-world scenarios to better evaluate and improve reward model effectiveness.
Contribution
The paper presents RMB, a new benchmark covering 49 scenarios, with evaluation methods that better reflect reward models' alignment performance and reveal generalization issues.
Findings
Positive correlation between RMB scores and downstream alignment performance.
Revealed generalization defects in state-of-the-art reward models.
Analyzed impact factors of generative reward models.
Abstract
Reward models (RMs) guide the alignment of large language models (LLMs), steering them toward behaviors preferred by humans. Evaluating RMs is the key to better aligning LLMs. However, the current evaluation of RMs may not directly correspond to their alignment performance due to the limited distribution of evaluation data and evaluation methods that are not closely related to alignment objectives. To address these limitations, we propose RMB, a comprehensive RM benchmark that covers over 49 real-world scenarios and includes both pairwise and Best-of-N (BoN) evaluations to better reflect the effectiveness of RMs in guiding alignment optimization. We demonstrate a positive correlation between our benchmark and the downstream alignment task performance. Based on our benchmark, we conduct extensive analysis on the state-of-the-art RMs, revealing their generalization defects that were not…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper is written well and is easy to understand. 2. The studied problem is significant.. 3. The results seem to outperform the SOTA datasets of the reward model evaluation.
1. The paper currently includes some discussion related to benchmark comparisons in Section 5.2, particularly with RewardBench. However, a more explicit comparison of the features and approaches of existing benchmarks early in the paper would better highlight the novelty of this work. Rather than relying on experimental results to convey superior performance, detailing how our model’s capabilities differ from those of previous benchmarks would strengthen the paper's contribution. 2. In the conc
RMB presents many strengths, especially against its main predecessor, RewardBench. Primarily, the dataset size is much larger than any similar datasets used to evaluate reward models. The paper does thorough investigation of best-of-N evaluation, which related papers had proposed as a future direction of research. The authors do a good job of categorization and subcategorization, to ensure broad coverage across different tasks. The benchmark shows good progress towards a useful proxy for downstr
There are a couple of weaknesses in the paper. The human annotator sample size is small relative to the size of the dataset. The correlation analysis focuses mainly on the best-of-N sampling rather than RLHF, which is acknowledged in the paper as a limitation. While mentioning that it is hard for a model to be both competitive in judging helpfulness and harmlessness, the authors don’t deeply explore the trade-offs between the two metrics. The authors used various models in the categorization and
- The paper is well-structured and clearly written, explaining the methodology and results effectively. - RMB effectively addresses the limitations of developing reward models that align with the objective, whether helpfulness or harmlessness. Correlation evaluations show that the reward model's performance on RMB can reflect the performance of the downstream aligned model more accurately. This would enable researchers to evaluate and iterate on reward models more efficiently before training a m
- The reliance on LLM-generated response in the dataset curation may pose potential long-term limitations. As reward models are often used to align newer LLMs, using responses from current LLMs to evaluate them could create a circular dependency. In other words, this benchmark may not be able to differentiate reward models that could guide LLMs that are beyond current capabilities. In the next iterations, it would be nice to incorporate human-generated responses to ensure more diverse sources of
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Mathematics, Computing, and Information Processing · Library Science and Information Systems
