Exploratory Preference Optimization: Harnessing Implicit   Q*-Approximation for Sample-Efficient RLHF

Tengyang Xie; Dylan J. Foster; Akshay Krishnamurthy; Corby Rosset,; Ahmed Awadallah; Alexander Rakhlin

arXiv:2405.21046·cs.LG·June 3, 2024·1 cites

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

Tengyang Xie, Dylan J. Foster, Akshay Krishnamurthy, Corby Rosset,, Ahmed Awadallah, Alexander Rakhlin

PDF

Open Access 10 Models

TL;DR

This paper introduces XPO, a new algorithm for reinforcement learning from human feedback that enhances exploration, is provably sample-efficient, and improves empirical performance in language model training.

Contribution

XPO is a simple, practical exploration algorithm that extends DPO with a novel bonus, offering strong theoretical guarantees and empirical improvements.

Findings

01

XPO outperforms non-exploratory DPO in sample efficiency

02

XPO is provably near-optimal under natural exploration conditions

03

Theoretical analysis combines language modeling and reinforcement learning techniques

Abstract

Reinforcement learning from human feedback (RLHF) has emerged as a central tool for language model alignment. We consider online exploration in RLHF, which exploits interactive access to human or AI feedback by deliberately encouraging the model to produce diverse, maximally informative responses. By allowing RLHF to confidently stray from the pre-trained model, online exploration offers the possibility of novel, potentially super-human capabilities, but its full potential as a paradigm for language model training has yet to be realized, owing to computational and statistical bottlenecks in directly adapting existing reinforcement learning techniques. We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO), which is simple and practical -- a one-line change to (online) Direct Preference Optimization (DPO; Rafailov et al., 2023) -- yet enjoys…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Data Management and Algorithms · Face and Expression Recognition

MethodsDirect Preference Optimization