Hybrid Preference Optimization for Alignment: Provably Faster   Convergence Rates by Combining Offline Preferences with Online Exploration

Avinandan Bose; Zhihan Xiong; Aadirupa Saha; Simon Shaolei Du; Maryam; Fazel

arXiv:2412.10616·cs.LG·December 17, 2024

Hybrid Preference Optimization for Alignment: Provably Faster Convergence Rates by Combining Offline Preferences with Online Exploration

Avinandan Bose, Zhihan Xiong, Aadirupa Saha, Simon Shaolei Du, Maryam, Fazel

PDF

Open Access

TL;DR

This paper introduces Hybrid Preference Optimization (HPO), a method that combines offline human preference data with online exploration to improve the efficiency and convergence of reinforcement learning from human feedback, with proven theoretical guarantees.

Contribution

The paper presents the first provably optimal hybrid RLHF algorithm that relaxes offline data requirements and enhances online exploration efficiency with matching lower bounds.

Findings

01

Improved sample efficiency over pure offline and online methods

02

Provably optimal theoretical bounds for hybrid RLHF

03

Relaxed offline data requirements for better exploration

Abstract

Reinforcement Learning from Human Feedback (RLHF) is currently the leading approach for aligning large language models with human preferences. Typically, these models rely on extensive offline preference datasets for training. However, offline algorithms impose strict concentrability requirements, which are often difficult to satisfy. On the other hand, while online algorithms can avoid the concentrability issue, pure online exploration could be expensive due to the active preference query cost and real-time implementation overhead. In this paper, we propose a novel approach: Hybrid Preference Optimization (HPO) which combines online exploration with existing offline preferences by relaxing the stringent concentrability conditions for offline exploration, as well as significantly improving the sample efficiency for its online counterpart. We give the first provably optimal theoretical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Manufacturing and Logistics Optimization · Vehicle Routing Optimization Methods · Constraint Satisfaction and Optimization