Policy Optimization in RLHF: The Impact of Out-of-preference Data

Ziniu Li; Tian Xu; Yang Yu

arXiv:2312.10584·cs.LG·February 27, 2024·1 cites

Policy Optimization in RLHF: The Impact of Out-of-preference Data

Ziniu Li, Tian Xu, Yang Yu

PDF

Open Access 1 Repo

TL;DR

This paper investigates how out-of-preference data influences policy optimization in RLHF, showing that leveraging such data with RMB-PO+ enhances alignment performance by improving reward model generalization.

Contribution

It introduces and evaluates the impact of out-of-preference data in policy optimization methods, highlighting the superiority of RMB-PO+ in leveraging this data for better alignment.

Findings

01

RMB-PO+ outperforms DPO in experiments.

02

Out-of-preference data significantly improves policy performance.

03

Reward model generalization benefits from preference-free data.

Abstract

Aligning intelligent agents with human preferences and values is important. This paper examines two popular alignment methods: Direct Preference Optimization (DPO) and Reward-Model-Based Policy Optimization (RMB-PO). A variant of RMB-PO, referred to as RMB-PO+ is also considered. These methods, either explicitly or implicitly, learn a reward model from preference data and differ in the data used for policy optimization to unlock the generalization ability of the reward model. In particular, compared with DPO, RMB-PO additionally uses policy-generated data, and RMB-PO+ further leverages new, preference-free data. We examine the impact of such out-of-preference data. Our study, conducted through controlled and synthetic experiments, demonstrates that DPO performs poorly, whereas RMB-PO+ performs the best. In particular, even when providing the policy model with a good feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liziniu/policy_optimization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Reinforcement Learning in Robotics · Recommender Systems and Techniques

MethodsDirect Preference Optimization