Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation

Xiaomeng Yang; Mengping Yang; Jia Gong; Luozheng Qin; Zhiyu Tan; Hao Li

arXiv:2502.02088·cs.CV·February 27, 2026

Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation

Xiaomeng Yang, Mengping Yang, Jia Gong, Luozheng Qin, Zhiyu Tan, Hao Li

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Dual-IPO, an iterative framework that enhances text-to-video generation by jointly optimizing reward and video models, leading to more aligned, high-quality videos without manual annotations.

Contribution

The paper proposes a novel dual-iterative optimization paradigm that improves video synthesis quality and user preference alignment through joint reward and generation model refinement.

Findings

01

Improves video quality across various architectures and sizes.

02

Enables smaller models to outperform larger ones.

03

Systematic ablation confirms effectiveness of each component.

Abstract

Recent advances in video generation have enabled thrilling experiences in producing realistic videos driven by scalable diffusion transformers. However, they usually fail to produce satisfactory outputs that are aligned to users' authentic demands and preferences. In this work, we introduce Dual-Iterative Optimization (Dual-IPO), an iterative paradigm that sequentially optimizes both the reward model and the video generation model for improved synthesis quality and human preference alignment. For the reward model, our framework ensures reliable and robust reward signals via CoT-guided reasoning, voting-based self-consistency, and preference certainty estimation. Given this, we optimize video foundation models with guidance of signals from reward model's feedback, thus improving the synthesis quality in subject consistency, motion smoothness and aesthetic quality, etc. The reward model…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1) The authors introduce a comprehensive and well-motivated dual-iterative framework that jointly optimizes the reward and generation models, addressing limitations of static preference alignment methods. 2) The proposed approach demonstrates strong data efficiency, requiring only a small amount of human-annotated preference data to initiate the self-refinement process. 3) The paper effectively captures the evolving nature of human preferences, emphasizing that fixed offline datasets may lead

Weaknesses

1) The construction of textual prompts used for generating training data is under-specified. The authors mention the use of structured elements (subjects, attributes, spatial relations, and actions), but the paper would benefit from more details regarding the **size, diversity, and balance of the prompt pool**. For instance, how many combinations were used per category, and how does this affect diversity and representativeness? 2) The construction of textual prompts may inadvertently introduce

Reviewer 02Rating 6Confidence 4

Strengths

- Dual optimization design – The interplay between a self-refined reward model and iterative generator updates is conceptually elegant and empirically validated. - Methodological soundness – The CoT-guided pseudo-labeling and PCE-weighted DPO/KTO training are carefully formulated and ablated. - Strong experimental validation – Includes results across model sizes, architectures, and both automatic and human evaluations, showing consistent improvement. - Clarity and completeness – Writing, figu

Weaknesses

- High computational cost – Each optimization round involves dual training of large models (VILA-40B, CogVideo-5B), making it unclear how scalable or practical Dual-IPO is for typical research labs. - Dependence on synthetic preference labels – Despite PCE filtering, the pseudo-label quality may still drift; more human validation or robustness analysis would strengthen the claims.

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper provides a motivation for addressing static reward limitations and distribution mismatch in preference alignment for generative video models. 2. The proposed Dual-IPO is a novel optimization paradigm. The idea of jointly optimizing the reward model and generator in a feedback loop is interesting and novel. It addresses reward drift and distribution mismatch issues, which are common in previous methods like DPO or RLHF. It also reduce the requirements on large-scale human annotations

Weaknesses

1. The paper does not analyze the convergence of the dual-iterative process. There is no proof that the process will stably converge stably to an optimal point, rather than oscillating. The paper also does not clearly state when the iterative process should stop. 2. Lack of ablation study. For example, the paper does not contain the complete ablation study of the three key parts of SRPO (CoT, self-consistency and PCE). 3. The method is complicated and inefficient, which is pointed out by the aut

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimedia Communication and Technology

MethodsALIGN