$\beta$-DPO: Direct Preference Optimization with Dynamic $\beta$
Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin, Ding, Xiang Wang, Xiangnan He

TL;DR
This paper introduces a dynamic $eta$ calibration method for Direct Preference Optimization, improving LLM alignment with human preferences by adapting to data quality and filtering out outliers.
Contribution
It proposes a novel framework that dynamically adjusts $eta$ during training and incorporates data filtering, enhancing DPO's robustness and performance.
Findings
Dynamic $eta$ improves model alignment with preferences.
Data filtering reduces the impact of outliers.
Significant performance gains across models and datasets.
Abstract
Direct Preference Optimization (DPO) has emerged as a compelling approach for training Large Language Models (LLMs) to adhere to human preferences. However, the performance of DPO is sensitive to the fine-tuning of its trade-off parameter , as well as to the quality of the preference data. We analyze the impact of and data quality on DPO, uncovering that optimal values vary with the informativeness of pairwise data. Addressing the limitations of static values, we introduce a novel framework that dynamically calibrates at the batch level, informed by data quality considerations. Additionally, our method incorporates -guided data filtering to safeguard against the influence of outliers. Through empirical evaluation, we demonstrate that our dynamic adjustment technique significantly improves DPO's performance across a range of models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Data Management and Algorithms · Rough Sets and Fuzzy Logic
MethodsDirect Preference Optimization
