$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin
Zhengyuan Fan, Zhonghua Wu, Yuxuan Du, Qun Chen

TL;DR
The paper introduces $\xi$-DPO, a novel preference optimization method that reformulates the objective to improve interpretability and eliminate hyperparameter tuning challenges in reference-free preference learning.
Contribution
It proposes a new ratio reward margin formulation that simplifies preference optimization and removes the need for tuning hyperparameters like $eta$ and $\gamma$.
Findings
$\xi$-DPO effectively cancels the effect of $eta$ in preference optimization.
The ratio reward margin $\xi$ is interpretable and can be set based on initial reward gap distribution.
Experimental results demonstrate improved stability and performance over existing methods.
Abstract
Reference-free preference optimization has emerged as an efficient alternative to reinforcement learning from human feedback, with Simple Preference Optimization(SimPO) demonstrating strong performance by eliminating the explicit reference model through a simple objective. However, the joint tuning of the hyperparameters and in SimPO remains a central challenge. We argue that this difficulty arises because the margin formulation in SimPO is not easily interpretable across datasets with different reward gap structures. To better understand this issue, we conduct a comprehensive analysis of SimPO and find that implicitly controls sample filtering, while the effect of depends on the reward gap structure of the dataset. Motivated by these observations, we propose -DPO: Direct preference optimization via ratio reward margin. We first reformulate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
