The Trickle-down Impact of Reward (In-)consistency on RLHF
Lingfeng Shen, Sihao Chen, Linfeng Song, Lifeng Jin and, Baolin Peng, Haitao Mi, Daniel Khashabi, Dong Yu

TL;DR
This paper investigates the inconsistency of reward models in RLHF, introduces a benchmarking method to measure it, and proposes techniques to improve RM consistency, ultimately enhancing the quality of RLHF-trained chatbots.
Contribution
It introduces Contrast Instructions for benchmarking RM consistency and proposes ConvexDA and RewardFusion to improve it without extra training costs.
Findings
Current RMs perform poorly on Contrast Instructions compared to humans.
Improving RM consistency leads to more useful RLHF chatbot responses.
Reward inconsistency has a significant trickle-down effect on RLHF outcomes.
Abstract
Standard practice within Reinforcement Learning from Human Feedback (RLHF) involves optimizing against a Reward Model (RM), which itself is trained to reflect human preferences for desirable generations. A notable subject that is understudied is the (in-)consistency of RMs -- whether they can recognize the semantic changes to different prompts and appropriately adapt their reward assignments -- and their impact on the downstream RLHF model. In this paper, we visit a series of research questions relevant to RM inconsistency: (1) How can we measure the consistency of reward models? (2) How consistent are the existing RMs and how can we improve them? (3) In what ways does reward inconsistency influence the chatbots resulting from the RLHF model training? We propose Contrast Instructions -- a benchmarking strategy for the consistency of RM. Each example in Contrast Instructions features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
