A Principled Loss Function for Direct Language Model Alignment
Yuandong Tan

TL;DR
This paper introduces a new loss function for language model alignment that is theoretically sound, stable, and prevents reward hacking, improving fine-tuning outcomes compared to existing methods.
Contribution
We propose a novel loss function derived from RLHF optimality, addressing DPO's theoretical misalignment and enhancing model stability and alignment quality.
Findings
Our method improves fine-tuning stability and alignment.
Achieves higher win rates than DPO baseline.
Performs competitively against larger models.
Abstract
The alignment of large language models (LLMs) with human preferences is commonly achieved through Reinforcement Learning from Human Feedback (RLHF). Direct Preference Optimization (DPO) simplified this paradigm by establishing a direct mapping between the optimal policy and a reward function, eliminating the need for an explicit reward model. However, we argue that the DPO loss function is theoretically misaligned with its own derivation, as it promotes the indefinite maximization of a logits difference, which can lead to training instability and reward hacking. In this paper, we propose a novel loss function derived directly from the RLHF optimality condition. Our proposed loss targets a specific, finite value for the logits difference, which is dictated by the underlying reward, rather than its maximization. We provide a theoretical analysis, including a gradient-based comparison, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
