Dataset Reset Policy Optimization for RLHF
Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Kiant\'e Brantley,, Dipendra Misra, Jason D. Lee, Wen Sun

TL;DR
This paper introduces Dataset Reset Policy Optimization (DR-PO), a new RLHF algorithm that leverages offline preference data through resets, providing theoretical guarantees and improved performance over existing methods in generative model fine-tuning.
Contribution
The paper proposes DR-PO, an RLHF algorithm that integrates offline preference datasets via resets, with provable guarantees and superior empirical results.
Findings
DR-PO outperforms PPO and DPO in GPT-4 win-rate metrics.
Theoretical guarantees show DR-PO performs at least as well as policies in the offline dataset.
Experimental results on summarization and helpfulness datasets demonstrate improved generation quality.
Abstract
Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques
MethodsAttention Is All You Need · Dropout · Adam · Position-Wise Feed-Forward Layer · Linear Layer · Layer Normalization · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Label Smoothing
