Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment
Yan Liu, Xiaoyuan Yi, Xiaokang Chen, Jing Yao, Jingwei Yi, Daoguang, Zan, Zheng Liu, Xing Xie, Tsung-Yi Ho

TL;DR
This paper highlights the critical importance of reward model quality in language model alignment, demonstrating that current reward models are unreliable and significantly impact alignment outcomes, urging more rigorous evaluation and development.
Contribution
It introduces a curated dataset CHH-RLHF, benchmarks reward model accuracy, and systematically studies how reward model quality affects alignment performance.
Findings
Reward models vary significantly in quality and reliability.
Better reward models serve as more accurate human preference proxies.
Reward model quality critically influences alignment success.
Abstract
The demand for regulating potentially risky behaviors of large language models (LLMs) has ignited research on alignment methods. Since LLM alignment heavily relies on reward models for optimization or evaluation, neglecting the quality of reward models may cause unreliable results or even misalignment. Despite the vital role reward models play in alignment, previous works have consistently overlooked their performance and used off-the-shelf reward models arbitrarily without verification, rendering the reward model ``\emph{an elephant in the room}''. To this end, this work first investigates the quality of the widely-used preference dataset, HH-RLHF, and curates a clean version, CHH-RLHF. Based on CHH-RLHF, we benchmark the accuracy of a broad range of reward models used in previous alignment works, unveiling the unreliability of using them both for optimization and evaluation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAccounting and Organizational Management
MethodsSoftmax · Attention Is All You Need
