RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, Juanzi Li

TL;DR
RM-Bench is a new benchmark that evaluates reward models based on their ability to detect subtle content differences and resist style biases, showing current models need significant improvement for better alignment.
Contribution
The paper introduces RM-Bench, a novel benchmark for assessing reward models' sensitivity to subtle content and style biases, correlating well with policy performance.
Findings
Nearly 40 reward models evaluated, with average performance of 46.6%.
Current models struggle with style bias, performing near random chance.
RM-Bench correlates strongly with policy model success.
Abstract
Reward models are critical in techniques like Reinforcement Learning from Human Feedback (RLHF) and Inference Scaling Laws, where they guide language model alignment and select optimal responses. Despite their importance, existing reward model benchmarks often evaluate models by asking them to distinguish between responses generated by models of varying power. However, this approach fails to assess reward models on subtle but critical content changes and variations in style, resulting in a low correlation with policy model performance. To this end, we introduce RM-Bench, a novel benchmark designed to evaluate reward models based on their sensitivity to subtle content differences and resistance to style biases. Extensive experiments demonstrate that RM-Bench strongly correlates with policy model performance, making it a reliable reference for selecting reward models to align language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗internlm/internlm-xcomposer2d5-7b-rewardmodel· 225 dl· ♡ 11225 dl♡ 11
- 🤗nvidia/Llama-3_3-Nemotron-Super-49B-GenRMmodel· 122 dl· ♡ 18122 dl♡ 18
- 🤗nvidia/Llama-3_3-Nemotron-Super-49B-GenRM-Multilingualmodel· 49 dl· ♡ 649 dl♡ 6
- 🤗nvidia/Llama-3.3-Nemotron-70B-Rewardmodel· 59 dl· ♡ 359 dl♡ 3
- 🤗nvidia/Llama-3.3-Nemotron-70B-Reward-Multilingualmodel· 38 dl· ♡ 1038 dl♡ 10
- 🤗nvidia/Qwen-2.5-Nemotron-32B-Rewardmodel· 18 dl· ♡ 218 dl♡ 2
- 🤗nvidia/Qwen-3-Nemotron-32B-Rewardmodel· 198 dl· ♡ 19198 dl♡ 19
- 🤗Bifrost-AI/Qwen-3-Nemotron-32B-Reward-F16model· 2 dl2 dl
- 🤗nvidia/Llama-3.3-Nemotron-70B-Reward-Principlemodel· 278 dl· ♡ 6278 dl♡ 6
- 🤗nvidia/Qwen3-Nemotron-32B-GenRM-Principlemodel· 738 dl· ♡ 14738 dl♡ 14
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsALIGN
