Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

Jiawei Huang; Bingcong Li; Christoph Dann; Niao He

arXiv:2502.19255·cs.LG·May 20, 2025

Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

Jiawei Huang, Bingcong Li, Christoph Dann, Niao He

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper explores how imperfect reward models can be leveraged to improve sample efficiency in online RLHF, introducing a new transfer learning algorithm with theoretical guarantees and practical transfer strategies.

Contribution

It introduces a novel transfer learning principle and the TPO algorithm, leveraging policy coverability insights for improved efficiency in RLHF.

Findings

01

TPO outperforms standard online learning methods.

02

Transfer strategies improve policy optimization performance.

03

Empirical results validate theoretical advantages.

Abstract

Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sample-efficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identifying a novel property due to KL-regularization in the RLHF objective: \emph{a policy's coverability of the optimal policy is captured by its sub-optimality}. Building on this insight, we propose novel transfer learning principles and a theoretical algorithm -- \emph{\textbf{T}ransfer \textbf{P}olicy \textbf{O}ptimization (\textbf{TPO})} -- with provable benefits compared to standard online learning. Empirically, inspired by our theoretical findings, we develop a win-rate-based transfer policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jiaweihhuang/rlhf_rewardtransfer
pytorchOfficial

Videos

Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Recommender Systems and Techniques

MethodsDirect Preference Optimization