It Takes Two: On the Seamlessness between Reward and Policy Model in   RLHF

Taiming Lu; Lingfeng Shen; Xinyu Yang; Weiting Tan; Beidi Chen; Huaxiu; Yao

arXiv:2406.07971·cs.CL·June 14, 2024

It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF

Taiming Lu, Lingfeng Shen, Xinyu Yang, Weiting Tan, Beidi Chen, Huaxiu, Yao

PDF

Open Access 1 Repo

TL;DR

This paper investigates the interaction between reward and policy models in RLHF, revealing a mismatch issue and proposing an automatic metric, SEAM, to improve training and augmentation, leading to notable performance gains.

Contribution

It introduces the concept of seamlessness between reward and policy models, identifies a mismatch problem, and proposes SEAM as an automatic metric to enhance RLHF training and augmentation.

Findings

01

SEAM improves RLHF performance by 4.5%.

02

SEAM-guided augmentation yields 4% better results.

03

Discovered a 35% mismatch rate between RMs and human preferences.

Abstract

Reinforcement Learning from Human Feedback (RLHF) involves training policy models (PMs) and reward models (RMs) to align language models with human preferences. Instead of focusing solely on PMs and RMs independently, we propose to examine their interactions during fine-tuning, introducing the concept of seamlessness. Our study starts with observing the saturation phenomenon, where continual improvements in RM and PM do not translate into RLHF progress. Our analysis shows that RMs fail to assign proper scores to PM responses, resulting in a 35% mismatch rate with human preferences, highlighting a significant discrepancy between PM and RM. To measure seamlessness between PM and RM without human effort, we propose an automatic metric, SEAM. SEAM quantifies the discrepancies between PM and RM judgments induced by data samples. We validate the effectiveness of SEAM in data selection and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

taiminglu/seamless
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHealthcare Policy and Management

MethodsSelf-supervised Equivariant Attention Mechanism · ALIGN