AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization

Junkang Wu; Xue Wang; Zhengyi Yang; Jiancan Wu; Jinyang Gao; Bolin Ding; Xiang Wang; Xiangnan He

arXiv:2410.10148·cs.LG·July 22, 2025

AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization

Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

PDF

Open Access 1 Repo

TL;DR

AlphaDPO introduces an adaptive reward margin mechanism for preference optimization, improving alignment and diversity in large language models by balancing policy and reference models dynamically.

Contribution

It proposes $oldsymbol{ extalpha}$-DPO, a novel adaptive preference optimization algorithm with theoretical guarantees, outperforming existing methods like DPO and SimPO in LLM fine-tuning.

Findings

01

Consistently outperforms DPO and SimPO in empirical evaluations.

02

Achieves higher win rates in alignment tasks.

03

Provides theoretical guarantees for adaptive reward margin effectiveness.

Abstract

Aligning large language models (LLMs) with human values and intentions is crucial for their utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a popular approach to achieve this alignment, but it faces challenges in computational efficiency and training stability. Recent methods like Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO) have proposed offline alternatives to RLHF, simplifying the process by reparameterizing the reward function. However, DPO depends on a potentially suboptimal reference model, and SimPO's assumption of a fixed target reward margin may lead to suboptimal decisions in diverse data settings. In this work, we propose $α$ -DPO, an adaptive preference optimization algorithm designed to address these limitations by introducing a dynamic reward margin. Specifically, $α$ -DPO employs an adaptive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

junkangwu/alpha-dpo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Data Management and Algorithms

MethodsDirect Preference Optimization