TL;DR
This paper introduces LocalDPO, a post-training framework for aligning text-to-video diffusion models with human preferences by optimizing at the spatio-temporal region level using localized preference pairs.
Contribution
LocalDPO constructs localized preference pairs from real videos, eliminating the need for external critics and manual annotations, and improves video quality and coherence.
Findings
LocalDPO enhances video fidelity and temporal coherence.
It outperforms other post-training methods in human preference scores.
The approach converges rapidly due to region-aware loss.
Abstract
Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
