Dataset Reset Policy Optimization for RLHF

Jonathan D. Chang; Wenhao Zhan; Owen Oertell; Kiant\'e Brantley,; Dipendra Misra; Jason D. Lee; Wen Sun

arXiv:2404.08495·cs.LG·April 17, 2024·1 cites

Dataset Reset Policy Optimization for RLHF

Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Kiant\'e Brantley,, Dipendra Misra, Jason D. Lee, Wen Sun

PDF

Open Access 1 Repo

TL;DR

This paper introduces Dataset Reset Policy Optimization (DR-PO), a new RLHF algorithm that leverages offline preference data through resets, providing theoretical guarantees and improved performance over existing methods in generative model fine-tuning.

Contribution

The paper proposes DR-PO, an RLHF algorithm that integrates offline preference datasets via resets, with provable guarantees and superior empirical results.

Findings

01

DR-PO outperforms PPO and DPO in GPT-4 win-rate metrics.

02

Theoretical guarantees show DR-PO performs at least as well as policies in the offline dataset.

03

Experimental results on summarization and helpfulness datasets demonstrate improved generation quality.

Abstract

Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cornell-rl/drpo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques

MethodsAttention Is All You Need · Dropout · Adam · Position-Wise Feed-Forward Layer · Linear Layer · Layer Normalization · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Label Smoothing