Value-Incentivized Preference Optimization: A Unified Approach to Online   and Offline RLHF

Shicong Cen; Jincheng Mei; Katayoon Goshvadi; Hanjun Dai; Tong Yang,; Sherry Yang; Dale Schuurmans; Yuejie Chi; Bo Dai

arXiv:2405.19320·cs.LG·February 20, 2025

Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF

Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang,, Sherry Yang, Dale Schuurmans, Yuejie Chi, Bo Dai

PDF

Open Access

TL;DR

This paper introduces VPO, a unified method for online and offline RLHF that incorporates uncertainty estimation into reward modeling, improving alignment of language models with human preferences.

Contribution

The paper proposes VPO, a novel approach that regularizes reward estimates with value functions, providing a unified, theoretically-grounded framework for RLHF in large language models.

Findings

01

VPO achieves competitive performance in text summarization and dialog tasks.

02

Theoretical guarantees match standard RL rates for both online and offline settings.

03

VPO simplifies the RLHF pipeline by integrating reward modeling and policy optimization.

Abstract

Reinforcement learning from human feedback (RLHF) has demonstrated great promise in aligning large language models (LLMs) with human preference. Depending on the availability of preference data, both online and offline RLHF are active areas of investigation. A key bottleneck is understanding how to incorporate uncertainty estimation in the reward function learned from the preference data for RLHF, regardless of how the preference data is collected. While the principles of optimism or pessimism under uncertainty are well-established in standard reinforcement learning (RL), a practically-implementable and theoretically-grounded form amenable to large language models is not yet available, as standard techniques for constructing confidence intervals become intractable under arbitrary policy parameterizations. In this paper, we introduce a unified approach to online and offline RLHF --…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScheduling and Optimization Algorithms · Advanced Manufacturing and Logistics Optimization · Vehicle Routing Optimization Methods