TL;DR
RAPO++ is a comprehensive prompt optimization framework for text-to-video generation that enhances prompt quality through data alignment, iterative refinement, and LLM fine-tuning, significantly improving output quality across multiple models and benchmarks.
Contribution
It introduces a three-stage prompt optimization approach that unifies data-aligned refinement, test-time scaling, and LLM fine-tuning without altering the generative backbone.
Findings
Achieves significant improvements in semantic alignment and video quality.
Outperforms existing methods on five benchmarks and five T2V models.
Demonstrates the effectiveness of prompt optimization in T2V tasks.
Abstract
Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present \textbf{RAPO++}, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In \textbf{Stage 1}, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. \textbf{Stage 2} introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
