Elephant in the Room: Unveiling the Impact of Reward Model Quality in   Alignment

Yan Liu; Xiaoyuan Yi; Xiaokang Chen; Jing Yao; Jingwei Yi; Daoguang; Zan; Zheng Liu; Xing Xie; Tsung-Yi Ho

arXiv:2409.19024·cs.CL·October 1, 2024

Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment

Yan Liu, Xiaoyuan Yi, Xiaokang Chen, Jing Yao, Jingwei Yi, Daoguang, Zan, Zheng Liu, Xing Xie, Tsung-Yi Ho

PDF

Open Access

TL;DR

This paper highlights the critical importance of reward model quality in language model alignment, demonstrating that current reward models are unreliable and significantly impact alignment outcomes, urging more rigorous evaluation and development.

Contribution

It introduces a curated dataset CHH-RLHF, benchmarks reward model accuracy, and systematically studies how reward model quality affects alignment performance.

Findings

01

Reward models vary significantly in quality and reliability.

02

Better reward models serve as more accurate human preference proxies.

03

Reward model quality critically influences alignment success.

Abstract

The demand for regulating potentially risky behaviors of large language models (LLMs) has ignited research on alignment methods. Since LLM alignment heavily relies on reward models for optimization or evaluation, neglecting the quality of reward models may cause unreliable results or even misalignment. Despite the vital role reward models play in alignment, previous works have consistently overlooked their performance and used off-the-shelf reward models arbitrarily without verification, rendering the reward model ``\emph{an elephant in the room}''. To this end, this work first investigates the quality of the widely-used preference dataset, HH-RLHF, and curates a clean version, CHH-RLHF. Based on CHH-RLHF, we benchmark the accuracy of a broad range of reward models used in previous alignment works, unveiling the unreliability of using them both for optimization and evaluation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAccounting and Organizational Management

MethodsSoftmax · Attention Is All You Need