Data-dependent Exploration for Online Reinforcement Learning from Human Feedback
Zhen-Yu Zhang, Yuting Tang, Jiandong Zhang, Lanjihong Ma, Masashi Sugiyama

TL;DR
This paper introduces DEPO, a data-dependent exploration method for online RLHF that uses historical data to improve sample efficiency in training language models with human feedback.
Contribution
The paper proposes a scalable, data-dependent exploration strategy for RLHF that leverages historical data to guide exploration and provides theoretical regret bounds.
Findings
DEPO outperforms strong baselines across benchmarks.
It improves sample efficiency in online RLHF.
Theoretical regret bounds adapt to task difficulty.
Abstract
Online reinforcement learning from human feedback (RLHF) has emerged as a promising paradigm for aligning large language models (LLMs) by continuously collecting new preference feedback during training. A foundational challenge in this setting is exploration, which requires algorithms that enable the LLMs to generate informative comparisons that improve sample-efficiency in online RLHF. Existing exploration strategies often derive bonuses via on-policy expectations, which are difficult to estimate reliably from the limited historical preference data available during training; as a result, the policy can prematurely down-weight under-explored regions that may contain high-value behaviors. In this paper, we propose data-dependent exploration for preference optimization (DEPO), a simple and scalable method that leverages historical data to construct an extra uncertainty bonus for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
