It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF
Taiming Lu, Lingfeng Shen, Xinyu Yang, Weiting Tan, Beidi Chen, Huaxiu, Yao

TL;DR
This paper investigates the interaction between reward and policy models in RLHF, revealing a mismatch issue and proposing an automatic metric, SEAM, to improve training and augmentation, leading to notable performance gains.
Contribution
It introduces the concept of seamlessness between reward and policy models, identifies a mismatch problem, and proposes SEAM as an automatic metric to enhance RLHF training and augmentation.
Findings
SEAM improves RLHF performance by 4.5%.
SEAM-guided augmentation yields 4% better results.
Discovered a 35% mismatch rate between RMs and human preferences.
Abstract
Reinforcement Learning from Human Feedback (RLHF) involves training policy models (PMs) and reward models (RMs) to align language models with human preferences. Instead of focusing solely on PMs and RMs independently, we propose to examine their interactions during fine-tuning, introducing the concept of seamlessness. Our study starts with observing the saturation phenomenon, where continual improvements in RM and PM do not translate into RLHF progress. Our analysis shows that RMs fail to assign proper scores to PM responses, resulting in a 35% mismatch rate with human preferences, highlighting a significant discrepancy between PM and RM. To measure seamlessness between PM and RM without human effort, we propose an automatic metric, SEAM. SEAM quantifies the discrepancies between PM and RM judgments induced by data samples. We validate the effectiveness of SEAM in data selection and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHealthcare Policy and Management
MethodsSelf-supervised Equivariant Attention Mechanism · ALIGN
