WPO: Enhancing RLHF with Weighted Preference Optimization
Wenxuan Zhou, Ravi Agrawal, Shujian Zhang, Sathish Reddy Indurthi,, Sanqiang Zhao, Kaiqiang Song, Silei Xu, Chenguang Zhu

TL;DR
WPO introduces a reweighting strategy for off-policy preference data in RLHF, improving alignment of language models with human values by simulating on-policy learning without extra costs.
Contribution
The paper proposes Weighted Preference Optimization (WPO), a novel method that mitigates distributional gaps in off-policy RLHF by reweighting preference data to resemble on-policy data.
Findings
WPO outperforms DPO by up to 5.6% on Alpaca Eval 2.
WPO achieves a 76.7% length-controlled winning rate against GPT-4-turbo.
WPO enhances RLHF without additional costs.
Abstract
Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization. In this paper, we propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗wzhouad/Llama3-Instruct-8B-WPO-FPmodel· 1 dl1 dl
- 🤗wzhouad/Llama3-Instruct-8B-WPO-HBmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗wzhouad/zephyr-7B-WPO-FPmodel· 3 dl3 dl
- 🤗wzhouad/zephyr-7B-WPO-HBmodel· 2 dl2 dl
- 🤗wzhouad/Llama3-Instruct-8B-WPO-HB-v2model· 4 dl· ♡ 54 dl♡ 5
- 🤗wzhouad/gemma-2-9b-it-WPO-HBmodel· 14 dl· ♡ 3414 dl♡ 34
- 🤗wzhouad/gemma-2-9b-it-WPO-FPmodel· 3 dl3 dl
- 🤗RichardErkhov/wzhouad_-_gemma-2-9b-it-WPO-HB-ggufmodel· 140 dl140 dl
- 🤗QuantFactory/gemma-2-9b-it-WPO-HB-GGUFmodel· 269 dl· ♡ 2269 dl♡ 2
- 🤗allura-org/Luna-27B-v0model· 14 dl· ♡ 1214 dl♡ 12
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization
MethodsALIGN
