Policy Optimization in RLHF: The Impact of Out-of-preference Data
Ziniu Li, Tian Xu, Yang Yu

TL;DR
This paper investigates how out-of-preference data influences policy optimization in RLHF, showing that leveraging such data with RMB-PO+ enhances alignment performance by improving reward model generalization.
Contribution
It introduces and evaluates the impact of out-of-preference data in policy optimization methods, highlighting the superiority of RMB-PO+ in leveraging this data for better alignment.
Findings
RMB-PO+ outperforms DPO in experiments.
Out-of-preference data significantly improves policy performance.
Reward model generalization benefits from preference-free data.
Abstract
Aligning intelligent agents with human preferences and values is important. This paper examines two popular alignment methods: Direct Preference Optimization (DPO) and Reward-Model-Based Policy Optimization (RMB-PO). A variant of RMB-PO, referred to as RMB-PO+ is also considered. These methods, either explicitly or implicitly, learn a reward model from preference data and differ in the data used for policy optimization to unlock the generalization ability of the reward model. In particular, compared with DPO, RMB-PO additionally uses policy-generated data, and RMB-PO+ further leverages new, preference-free data. We examine the impact of such out-of-preference data. Our study, conducted through controlled and synthetic experiments, demonstrates that DPO performs poorly, whereas RMB-PO+ performs the best. In particular, even when providing the policy model with a good feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Reinforcement Learning in Robotics · Recommender Systems and Techniques
MethodsDirect Preference Optimization
