DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

Ziyi Wu; Anil Kag; Ivan Skorokhodov; Willi Menapace; Ashkan Mirzaei; Igor Gilitschenski; Sergey Tulyakov; Aliaksandr Siarohin

arXiv:2506.03517·cs.CV·October 13, 2025

DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, Aliaksandr Siarohin

PDF

Open Access

TL;DR

DenseDPO enhances text-to-video diffusion models by enabling fine-grained, motion-neutral preference learning through aligned video pairs and segment-level annotations, significantly improving motion generation quality.

Contribution

It introduces DenseDPO, a novel method that creates aligned video pairs and uses segment-level preference labels, reducing data needs and bias in preference optimization for video diffusion models.

Findings

01

DenseDPO improves motion generation over vanilla DPO.

02

DenseDPO matches vanilla DPO in text alignment and visual quality.

03

Automatic preference annotation with VLMs is effective.

Abstract

Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Video Analysis and Summarization