The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield   Better Language Models

Yanjun Chen; Dawei Zhu; Yirong Sun; Xinghao Chen; Wei Zhang; Xiaoyu; Shen

arXiv:2410.06554·cs.CL·October 17, 2024

The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Yanjun Chen, Dawei Zhu, Yirong Sun, Xinghao Chen, Wei Zhang, Xiaoyu, Shen

PDF

Open Access 1 Repo

TL;DR

This paper reveals that in RLHF for language models, using moderately accurate reward models can outperform highly accurate ones, challenging the assumption that better reward models always produce better language models.

Contribution

It uncovers the paradox that stronger reward models do not always lead to improved language model performance, based on experiments with the QA-FEEDBACK dataset.

Findings

01

Moderately accurate reward models outperform highly accurate ones in training.

02

The paradox challenges existing beliefs about reward model strength and model quality.

03

Results suggest the need to reconsider reward model selection strategies.

Abstract

Reinforcement Learning from Human Feedback significantly enhances Natural Language Processing by aligning language models with human expectations. A critical factor in this alignment is the strength of reward models used during training. This study explores whether stronger reward models invariably lead to better language models. In this paper, through experiments on relevance, factuality, and completeness tasks using the QA-FEEDBACK dataset and reward models based on Longformer, we uncover a surprising paradox: language models trained with moderately accurate reward models outperform those guided by highly accurate ones. This challenges the widely held belief that stronger reward models always lead to better language models, and opens up new avenues for future research into the key factors driving model performance and how to choose the most suitable reward models. Code and additional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

EIT-NLP/AccuracyParadox-RLHF
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · How do I get a human at Expedia immediately? (2025-2026) · Linear Layer · Weight Decay · AdamW · Attention Is All You Need · How do I complain to Expedia?*ComplainByAgent · Linear Warmup With Linear Decay · Dropout · Softmax