Hybrid Preference Optimization for Alignment: Provably Faster Convergence Rates by Combining Offline Preferences with Online Exploration
Avinandan Bose, Zhihan Xiong, Aadirupa Saha, Simon Shaolei Du, Maryam, Fazel

TL;DR
This paper introduces Hybrid Preference Optimization (HPO), a method that combines offline human preference data with online exploration to improve the efficiency and convergence of reinforcement learning from human feedback, with proven theoretical guarantees.
Contribution
The paper presents the first provably optimal hybrid RLHF algorithm that relaxes offline data requirements and enhances online exploration efficiency with matching lower bounds.
Findings
Improved sample efficiency over pure offline and online methods
Provably optimal theoretical bounds for hybrid RLHF
Relaxed offline data requirements for better exploration
Abstract
Reinforcement Learning from Human Feedback (RLHF) is currently the leading approach for aligning large language models with human preferences. Typically, these models rely on extensive offline preference datasets for training. However, offline algorithms impose strict concentrability requirements, which are often difficult to satisfy. On the other hand, while online algorithms can avoid the concentrability issue, pure online exploration could be expensive due to the active preference query cost and real-time implementation overhead. In this paper, we propose a novel approach: Hybrid Preference Optimization (HPO) which combines online exploration with existing offline preferences by relaxing the stringent concentrability conditions for offline exploration, as well as significantly improving the sample efficiency for its online counterpart. We give the first provably optimal theoretical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Manufacturing and Logistics Optimization · Vehicle Routing Optimization Methods · Constraint Satisfaction and Optimization
