A Principled Loss Function for Direct Language Model Alignment

Yuandong Tan

arXiv:2508.07137·cs.LG·September 26, 2025

A Principled Loss Function for Direct Language Model Alignment

Yuandong Tan

PDF

Open Access

TL;DR

This paper introduces a new loss function for language model alignment that is theoretically sound, stable, and prevents reward hacking, improving fine-tuning outcomes compared to existing methods.

Contribution

We propose a novel loss function derived from RLHF optimality, addressing DPO's theoretical misalignment and enhancing model stability and alignment quality.

Findings

01

Our method improves fine-tuning stability and alignment.

02

Achieves higher win rates than DPO baseline.

03

Performs competitively against larger models.

Abstract

The alignment of large language models (LLMs) with human preferences is commonly achieved through Reinforcement Learning from Human Feedback (RLHF). Direct Preference Optimization (DPO) simplified this paradigm by establishing a direct mapping between the optimal policy and a reward function, eliminating the need for an explicit reward model. However, we argue that the DPO loss function is theoretically misaligned with its own derivation, as it promotes the indefinite maximization of a logits difference, which can lead to training instability and reward hacking. In this paper, we propose a novel loss function derived directly from the RLHF optimality condition. Our proposed loss targets a specific, finite value for the logits difference, which is dictated by the underlying reward, rather than its maximization. We provide a theoretical analysis, including a gradient-based comparison, to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling