Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF
Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang,, Sherry Yang, Dale Schuurmans, Yuejie Chi, Bo Dai

TL;DR
This paper introduces VPO, a unified method for online and offline RLHF that incorporates uncertainty estimation into reward modeling, improving alignment of language models with human preferences.
Contribution
The paper proposes VPO, a novel approach that regularizes reward estimates with value functions, providing a unified, theoretically-grounded framework for RLHF in large language models.
Findings
VPO achieves competitive performance in text summarization and dialog tasks.
Theoretical guarantees match standard RL rates for both online and offline settings.
VPO simplifies the RLHF pipeline by integrating reward modeling and policy optimization.
Abstract
Reinforcement learning from human feedback (RLHF) has demonstrated great promise in aligning large language models (LLMs) with human preference. Depending on the availability of preference data, both online and offline RLHF are active areas of investigation. A key bottleneck is understanding how to incorporate uncertainty estimation in the reward function learned from the preference data for RLHF, regardless of how the preference data is collected. While the principles of optimism or pessimism under uncertainty are well-established in standard reinforcement learning (RL), a practically-implementable and theoretically-grounded form amenable to large language models is not yet available, as standard techniques for constructing confidence intervals become intractable under arbitrary policy parameterizations. In this paper, we introduce a unified approach to online and offline RLHF --…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScheduling and Optimization Algorithms · Advanced Manufacturing and Logistics Optimization · Vehicle Routing Optimization Methods
