The Trickle-down Impact of Reward (In-)consistency on RLHF

Lingfeng Shen; Sihao Chen; Linfeng Song; Lifeng Jin and; Baolin Peng; Haitao Mi; Daniel Khashabi; Dong Yu

arXiv:2309.16155·cs.CL·September 29, 2023·1 cites

The Trickle-down Impact of Reward (In-)consistency on RLHF

Lingfeng Shen, Sihao Chen, Linfeng Song, Lifeng Jin and, Baolin Peng, Haitao Mi, Daniel Khashabi, Dong Yu

PDF

Open Access 1 Repo

TL;DR

This paper investigates the inconsistency of reward models in RLHF, introduces a benchmarking method to measure it, and proposes techniques to improve RM consistency, ultimately enhancing the quality of RLHF-trained chatbots.

Contribution

It introduces Contrast Instructions for benchmarking RM consistency and proposes ConvexDA and RewardFusion to improve it without extra training costs.

Findings

01

Current RMs perform poorly on Contrast Instructions compared to humans.

02

Improving RM consistency leads to more useful RLHF chatbot responses.

03

Reward inconsistency has a significant trickle-down effect on RLHF outcomes.

Abstract

Standard practice within Reinforcement Learning from Human Feedback (RLHF) involves optimizing against a Reward Model (RM), which itself is trained to reflect human preferences for desirable generations. A notable subject that is understudied is the (in-)consistency of RMs -- whether they can recognize the semantic changes to different prompts and appropriately adapt their reward assignments -- and their impact on the downstream RLHF model. In this paper, we visit a series of research questions relevant to RM inconsistency: (1) How can we measure the consistency of reward models? (2) How consistent are the existing RMs and how can we improve them? (3) In what ways does reward inconsistency influence the chatbots resulting from the RLHF model training? We propose Contrast Instructions -- a benchmarking strategy for the consistency of RM. Each example in Contrast Instructions features…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shadowkiller33/contrast-instruction
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics