RM-Bench: Benchmarking Reward Models of Language Models with Subtlety   and Style

Yantao Liu; Zijun Yao; Rui Min; Yixin Cao; Lei Hou; Juanzi Li

arXiv:2410.16184·cs.CL·October 22, 2024

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, Juanzi Li

PDF

Open Access 1 Repo 10 Models 3 Datasets

TL;DR

RM-Bench is a new benchmark that evaluates reward models based on their ability to detect subtle content differences and resist style biases, showing current models need significant improvement for better alignment.

Contribution

The paper introduces RM-Bench, a novel benchmark for assessing reward models' sensitivity to subtle content and style biases, correlating well with policy performance.

Findings

01

Nearly 40 reward models evaluated, with average performance of 46.6%.

02

Current models struggle with style bias, performing near random chance.

03

RM-Bench correlates strongly with policy model success.

Abstract

Reward models are critical in techniques like Reinforcement Learning from Human Feedback (RLHF) and Inference Scaling Laws, where they guide language model alignment and select optimal responses. Despite their importance, existing reward model benchmarks often evaluate models by asking them to distinguish between responses generated by models of varying power. However, this approach fails to assess reward models on subtle but critical content changes and variations in style, resulting in a low correlation with policy model performance. To this end, we introduce RM-Bench, a novel benchmark designed to evaluate reward models based on their sensitivity to subtle content differences and resistance to style biases. Extensive experiments demonstrate that RM-Bench strongly correlates with policy model performance, making it a reliable reference for selecting reward models to align language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thu-keg/rm-bench
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsALIGN