Retrieval, Refinement, and Ranking for Text-to-Video Generation via Prompt Optimization and Test-Time Scaling

Zillur Rahman; Alex Sheng; Cristian Meo

arXiv:2603.01509·cs.CV·March 3, 2026

Retrieval, Refinement, and Ranking for Text-to-Video Generation via Prompt Optimization and Test-Time Scaling

Zillur Rahman, Alex Sheng, Cristian Meo

PDF

Open Access

TL;DR

This paper introduces 3R, a prompt optimization framework for Text-to-Video generation that improves video quality and coherence without requiring model training, by leveraging retrieval, refinement, and ranking strategies.

Contribution

The paper presents a novel prompt optimization method for T2V models that enhances output quality using retrieval, preference optimization, and temporal interpolation, without additional model training.

Findings

01

Improves static fidelity of generated videos.

02

Enhances dynamic coherence and temporal consistency.

03

Effective across different T2V models without retraining.

Abstract

While large-scale datasets have driven significant progress in Text-to-Video (T2V) generative models, these models remain highly sensitive to input prompts, demonstrating that prompt design is critical to generation quality. Current methods for improving video output often fall short: they either depend on complex, post-editing models, risking the introduction of artifacts, or require expensive fine-tuning of the core generator, which severely limits both scalability and accessibility. In this work, we introduce 3R, a novel RAG based prompt optimization framework. 3R utilizes the power of current state-of-the-art T2V diffusion model and vision language model. It can be used with any T2V model without any kind of model training. The framework leverages three key strategies: RAG-based modifiers extraction for enriched contextual grounding, diffusion-based Preference Optimization for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Video Analysis and Summarization