Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF
Tengyang Xie, Dylan J. Foster, Akshay Krishnamurthy, Corby Rosset,, Ahmed Awadallah, Alexander Rakhlin

TL;DR
This paper introduces XPO, a new algorithm for reinforcement learning from human feedback that enhances exploration, is provably sample-efficient, and improves empirical performance in language model training.
Contribution
XPO is a simple, practical exploration algorithm that extends DPO with a novel bonus, offering strong theoretical guarantees and empirical improvements.
Findings
XPO outperforms non-exploratory DPO in sample efficiency
XPO is provably near-optimal under natural exploration conditions
Theoretical analysis combines language modeling and reinforcement learning techniques
Abstract
Reinforcement learning from human feedback (RLHF) has emerged as a central tool for language model alignment. We consider online exploration in RLHF, which exploits interactive access to human or AI feedback by deliberately encouraging the model to produce diverse, maximally informative responses. By allowing RLHF to confidently stray from the pre-trained model, online exploration offers the possibility of novel, potentially super-human capabilities, but its full potential as a paradigm for language model training has yet to be realized, owing to computational and statistical bottlenecks in directly adapting existing reinforcement learning techniques. We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO), which is simple and practical -- a one-line change to (online) Direct Preference Optimization (DPO; Rafailov et al., 2023) -- yet enjoys…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗qgallouedec/xpo-qwen2model· 2 dl2 dl
- 🤗RichardErkhov/qgallouedec_-_xpo-qwen2-ggufmodel· 338 dl338 dl
- 🤗trl-lib/Qwen2-0.5B-XPOmodel· 5 dl5 dl
- 🤗MYC081/Qwen2.5-3B-WPO-bf16-1model· 5 dl5 dl
- 🤗RichardErkhov/qgallouedec_-_xpo-qwen2-awqmodel· 1 dl1 dl
- 🤗RichardErkhov/trl-lib_-_Qwen2-0.5B-XPO-exl2model
- 🤗ntlfi/qwen2-0.5b-it_XPO_iter3model
- 🤗ntlfi/qwen2-0.5b-it_XPO_iter1_100model
- 🤗ntlfi/qwen2-0.5b-it_XPO_iter2_100model
- 🤗ntlfi/qwen2-0.5b-it_XPO_iter3_100model· 8 dl8 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Data Management and Algorithms · Face and Expression Recognition
MethodsDirect Preference Optimization
