What Is Preference Optimization Doing, and Why?

Yue Wang; Qizhou Wang; Zizhuo Zhang; Gang Niu; Bo Han; Masashi Sugiyama

arXiv:2512.00778·cs.LG·May 18, 2026

What Is Preference Optimization Doing, and Why?

Yue Wang, Qizhou Wang, Zizhuo Zhang, Gang Niu, Bo Han, Masashi Sugiyama

PDF

3 Reviews

TL;DR

This paper analyzes the optimization dynamics of preference optimization methods like DPO and PPO in large language models, revealing their distinct behaviors and roles of key components to improve understanding and development.

Contribution

It provides a detailed analysis of the underlying causes of differences between DPO and PPO, offering new insights into their optimization dynamics and roles of components.

Findings

01

DPO follows stable target directions, while PPO balances exploration and exploitation.

02

Loss reweighting in DPO acts as a regularizer, not a reward signal.

03

Negative learning in PPO primarily supports exploration.

Abstract

Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is supervised learning while PPO is reinforcement learning, yet deeper analyses for the reasons underlying these differences remain lacking. To fill this gap, we analyze their optimization dynamics, revealing distinct algorithmic behaviors and comprehending their underlying causes. First, we examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO balances exploration and exploitation, validating the common belief yet from this new perspective. Second, we examine the roles of positive learning, negative learning, and loss reweighting, which are three key yet seldom discussed components within PO methods. Our…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

- Well-structured decomposition of PO into positive/negative learning and reweights; which connects intuitively to training heuristics. - The *gradient alignment* tool is simple, and allows concrete insights. - Evaluates variants of PO (cDPO, cPPO, hPPO) from the insights acquired from the analysis.

Weaknesses

- Insufficient breadth and scale of experiments - The paper uses a single base model (Pythia-2.8b) and narrow task sets. The claims in the paper about "what PO is doing" should be tested on larger models, multiple families, and varied domains. - Although the motivation of the paper seems promising, empirical proof of PO tendency should be backed up with much more depth. - Lack of theoretical framing - Tightening the theoretical relation between $G$ and the performance can strenghthen

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper provides a deep, mechanistic explanation for the oft-discussed differences between DPO and PPO by skillfully analyzing their respective training dynamics. 2. The introduction of the 'gradient alignment' metric is a notable contribution, offering an effective method to quantify and inspect the optimization dynamics of preference alignment algorithms. 3. The findings are clear and insightful, providing actionable explanations for the distinct roles of positive learning, negative learn

Weaknesses

1. The paper provides extensive empirical analysis, but it lacks a rigorous theoretical foundation to formally explain the underlying reasons for the observed phenomena. 2. The analysis could be strengthened by incorporating the distribution of key data properties. For instance, analyzing the distributions of the DPO reweighting term ($\omega$) and the PPO absolute advantage ($|\hat A|$), both globally and within subgroups, would provide a more complete picture of their impact. 3. The 'gradient

Reviewer 03Rating 2Confidence 4

Strengths

* The components of preference learning, *e.g.*, positive and negative learning and loss re-weighting, are analyzed thoroughly. * Ablation study strengthens the persuasiveness of the conclusions and provides insights for future research.

Weaknesses

I deem that several logical flaws hinders the soundness of the conclusions, so I lean to reject the paper. I would like to raise my score if these concerns are well addressed. * L105: Why does the distinction between SFT and RL lie in whether they have relatively stable targets? I deem the difference between SFT and RL lies in whether they learn from demonstrations or rewards. * L134 (Minor): I do not think the objective is inherently non-differentiable. * L143: It is not very clear to me why t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)