Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective
Jiawei Huang, Bingcong Li, Christoph Dann, Niao He

TL;DR
This paper explores how imperfect reward models can be leveraged to improve sample efficiency in online RLHF, introducing a new transfer learning algorithm with theoretical guarantees and practical transfer strategies.
Contribution
It introduces a novel transfer learning principle and the TPO algorithm, leveraging policy coverability insights for improved efficiency in RLHF.
Findings
TPO outperforms standard online learning methods.
Transfer strategies improve policy optimization performance.
Empirical results validate theoretical advantages.
Abstract
Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sample-efficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identifying a novel property due to KL-regularization in the RLHF objective: \emph{a policy's coverability of the optimal policy is captured by its sub-optimality}. Building on this insight, we propose novel transfer learning principles and a theoretical algorithm -- \emph{\textbf{T}ransfer \textbf{P}olicy \textbf{O}ptimization (\textbf{TPO})} -- with provable benefits compared to standard online learning. Empirically, inspired by our theoretical findings, we develop a win-rate-based transfer policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Recommender Systems and Techniques
MethodsDirect Preference Optimization
