AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization
Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

TL;DR
AlphaDPO introduces an adaptive reward margin mechanism for preference optimization, improving alignment and diversity in large language models by balancing policy and reference models dynamically.
Contribution
It proposes $oldsymbol{ extalpha}$-DPO, a novel adaptive preference optimization algorithm with theoretical guarantees, outperforming existing methods like DPO and SimPO in LLM fine-tuning.
Findings
Consistently outperforms DPO and SimPO in empirical evaluations.
Achieves higher win rates in alignment tasks.
Provides theoretical guarantees for adaptive reward margin effectiveness.
Abstract
Aligning large language models (LLMs) with human values and intentions is crucial for their utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a popular approach to achieve this alignment, but it faces challenges in computational efficiency and training stability. Recent methods like Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO) have proposed offline alternatives to RLHF, simplifying the process by reparameterizing the reward function. However, DPO depends on a potentially suboptimal reference model, and SimPO's assumption of a fixed target reward margin may lead to suboptimal decisions in diverse data settings. In this work, we propose -DPO, an adaptive preference optimization algorithm designed to address these limitations by introducing a dynamic reward margin. Specifically, -DPO employs an adaptive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Data Management and Algorithms
MethodsDirect Preference Optimization
