Minor DPO reject penalty to increase training robustness

Shiming Xie; Hong Chen; Fred Yu; Zeye Sun; Xiuyu Wu; Yingfan Hu

arXiv:2408.09834·cs.AI·September 2, 2024

Minor DPO reject penalty to increase training robustness

Shiming Xie, Hong Chen, Fred Yu, Zeye Sun, Xiuyu Wu, Yingfan Hu

PDF

Open Access

TL;DR

This paper analyzes the Direct Preference Optimization (DPO) method for LLM fine-tuning, identifies its limitations, and proposes MinorDPO to improve alignment and training stability by adjusting the penalty term.

Contribution

The paper introduces MinorDPO, a modification to DPO that enhances alignment with RL algorithms and increases training robustness in preference-based LLM fine-tuning.

Findings

01

MinorDPO improves training stability.

02

MinorDPO better aligns with RL algorithms.

03

Analysis of DPO's $eta$ parameter reveals its impact.

Abstract

Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task. In the past it uses reinforcement learning from human feedback (RLHF) algorithm to optimize the LLM policy to align with these preferences and not to draft too far from the original model. Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method. Using preference pairs of chosen and reject data, DPO models the relative log probability as implicit reward function and optimize LLM policy using a simple binary cross entropy objective directly. DPO is quite straight forward and easy to be understood. It perform efficiently and well in most cases. In this article, we analyze the working mechanism of $β$ in DPO, disclose its syntax difference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCardiac Valve Diseases and Treatments

MethodsDirect Preference Optimization · ALIGN