Minor DPO reject penalty to increase training robustness
Shiming Xie, Hong Chen, Fred Yu, Zeye Sun, Xiuyu Wu, Yingfan Hu

TL;DR
This paper analyzes the Direct Preference Optimization (DPO) method for LLM fine-tuning, identifies its limitations, and proposes MinorDPO to improve alignment and training stability by adjusting the penalty term.
Contribution
The paper introduces MinorDPO, a modification to DPO that enhances alignment with RL algorithms and increases training robustness in preference-based LLM fine-tuning.
Findings
MinorDPO improves training stability.
MinorDPO better aligns with RL algorithms.
Analysis of DPO's $eta$ parameter reveals its impact.
Abstract
Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task. In the past it uses reinforcement learning from human feedback (RLHF) algorithm to optimize the LLM policy to align with these preferences and not to draft too far from the original model. Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method. Using preference pairs of chosen and reject data, DPO models the relative log probability as implicit reward function and optimize LLM policy using a simple binary cross entropy objective directly. DPO is quite straight forward and easy to be understood. It perform efficiently and well in most cases. In this article, we analyze the working mechanism of in DPO, disclose its syntax difference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCardiac Valve Diseases and Treatments
MethodsDirect Preference Optimization · ALIGN
