Beyond Reward Margin: Rethinking and Resolving Likelihood Displacement in Diffusion Models via Video Generation
Ruojun Xu, Yu Kai, Xuhua Ren, Jiaxiang Cheng, Bing Ma, Tianxiang Zheng, Qinhlin Lu

TL;DR
This paper identifies and addresses likelihood displacement in diffusion models during preference optimization, proposing PG-DPO with ARS and IPR to enhance video generation quality and alignment.
Contribution
It provides a formal analysis of likelihood displacement in diffusion models and introduces PG-DPO, a novel method to mitigate this issue in preference-based training.
Findings
PG-DPO outperforms existing methods in quantitative metrics.
The proposed approach improves qualitative video generation results.
Analysis reveals two main failure modes in likelihood displacement.
Abstract
Direct Preference Optimization (DPO) has shown promising results in aligning generative outputs with human preferences by distinguishing between chosen and rejected samples. However, a critical limitation of DPO is likelihood displacement, where the probabilities of chosen samples paradoxically decrease during training, undermining the quality of generation. Although this issue has been investigated in autoregressive models, its impact within diffusion-based models remains largely unexplored. This gap leads to suboptimal performance in tasks involving video generation. To address this, we conduct a formal analysis of DPO loss through updating policy within the diffusion framework, which describes how the updating of specific training samples influences the model's predictions on other samples. Using this tool, we identify two main failure modes: (1) Optimization Conflict, which arises…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1. The paper provides a novel formal decomposition of DPO's updating dynamics in diffusion models, revealing actionable failure modes. 2. The proposed approach PG-DPO effectively combines ARS and IPR to address both small- and large-margin issues, with empirical evidence (e.g., Fig. 2) showing consistent probability increases for chosen samples. 3. Also the framework extends to other fine-tuning algorithms (e.g., SFT, KTO) and high-dimensional tasks like video generation.
1. Hyperparameters (e.g., K1, K2 in ARS/IPR) introduce tuning complexity without clear ablation studies. 2. Experimental details (e.g., datasets, baselines, quantitative metrics) are referenced but not fully provided in the visible pages, making it hard to assess reproducibility or superiority claims. for example, SFT is proven to be the most effective way for post-training. The paper lacks the pipeline of choosing post-training data. Is the DPO done only on pretraining or also on SFT as well?
- Modify DPO objective based on identified failure modes.
- The modification to DPO is considered as incremental. - Insufficient experiments: - The proposed method is not validated on image generation. - The proposed method is only compared with VideoDPO in experiments.
**Clarity:** 1. The paper is well-structured: problem → analysis → solution → experiments. 2. Visualizations (likelihood trajectories, qualitative generations) clearly support the claims.
**Hyperparameter sensitivity:** PG-DPO introduces new hyperparameters. The paper admits these require careful tuning, but no adaptive scheme or systematic guideline is provided. Ablation study is encouraged. **Generality claims:** The framework is said to be extensible to various fine-tuning methods, but no empirical validation outside DPO is provided.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Recommender Systems and Techniques · Advanced Multi-Objective Optimization Algorithms
