Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation
Xiaomeng Yang, Mengping Yang, Jia Gong, Luozheng Qin, Zhiyu Tan, Hao Li

TL;DR
This paper introduces Dual-IPO, an iterative framework that enhances text-to-video generation by jointly optimizing reward and video models, leading to more aligned, high-quality videos without manual annotations.
Contribution
The paper proposes a novel dual-iterative optimization paradigm that improves video synthesis quality and user preference alignment through joint reward and generation model refinement.
Findings
Improves video quality across various architectures and sizes.
Enables smaller models to outperform larger ones.
Systematic ablation confirms effectiveness of each component.
Abstract
Recent advances in video generation have enabled thrilling experiences in producing realistic videos driven by scalable diffusion transformers. However, they usually fail to produce satisfactory outputs that are aligned to users' authentic demands and preferences. In this work, we introduce Dual-Iterative Optimization (Dual-IPO), an iterative paradigm that sequentially optimizes both the reward model and the video generation model for improved synthesis quality and human preference alignment. For the reward model, our framework ensures reliable and robust reward signals via CoT-guided reasoning, voting-based self-consistency, and preference certainty estimation. Given this, we optimize video foundation models with guidance of signals from reward model's feedback, thus improving the synthesis quality in subject consistency, motion smoothness and aesthetic quality, etc. The reward model…
Peer Reviews
Decision·ICLR 2026 Poster
1) The authors introduce a comprehensive and well-motivated dual-iterative framework that jointly optimizes the reward and generation models, addressing limitations of static preference alignment methods. 2) The proposed approach demonstrates strong data efficiency, requiring only a small amount of human-annotated preference data to initiate the self-refinement process. 3) The paper effectively captures the evolving nature of human preferences, emphasizing that fixed offline datasets may lead
1) The construction of textual prompts used for generating training data is under-specified. The authors mention the use of structured elements (subjects, attributes, spatial relations, and actions), but the paper would benefit from more details regarding the **size, diversity, and balance of the prompt pool**. For instance, how many combinations were used per category, and how does this affect diversity and representativeness? 2) The construction of textual prompts may inadvertently introduce
- Dual optimization design – The interplay between a self-refined reward model and iterative generator updates is conceptually elegant and empirically validated. - Methodological soundness – The CoT-guided pseudo-labeling and PCE-weighted DPO/KTO training are carefully formulated and ablated. - Strong experimental validation – Includes results across model sizes, architectures, and both automatic and human evaluations, showing consistent improvement. - Clarity and completeness – Writing, figu
- High computational cost – Each optimization round involves dual training of large models (VILA-40B, CogVideo-5B), making it unclear how scalable or practical Dual-IPO is for typical research labs. - Dependence on synthetic preference labels – Despite PCE filtering, the pseudo-label quality may still drift; more human validation or robustness analysis would strengthen the claims.
1. The paper provides a motivation for addressing static reward limitations and distribution mismatch in preference alignment for generative video models. 2. The proposed Dual-IPO is a novel optimization paradigm. The idea of jointly optimizing the reward model and generator in a feedback loop is interesting and novel. It addresses reward drift and distribution mismatch issues, which are common in previous methods like DPO or RLHF. It also reduce the requirements on large-scale human annotations
1. The paper does not analyze the convergence of the dual-iterative process. There is no proof that the process will stably converge stably to an optimal point, rather than oscillating. The paper also does not clearly state when the iterative process should stop. 2. Lack of ablation study. For example, the paper does not contain the complete ablation study of the three key parts of SRPO (CoT, self-consistency and PCE). 3. The method is complicated and inefficient, which is pointed out by the aut
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimedia Communication and Technology
MethodsALIGN
