Free$^2$Guide: Training-Free Text-to-Video Alignment using Image LVLM

Jaemin Kim; Bryan Sangwoo Kim; Jong Chul Ye

arXiv:2411.17041·cs.CV·October 21, 2025

Free$^2$Guide: Training-Free Text-to-Video Alignment using Image LVLM

Jaemin Kim, Bryan Sangwoo Kim, Jong Chul Ye

PDF

Open Access

TL;DR

Free$^2$Guide introduces a training-free, gradient-free framework that leverages large vision-language models to improve text-to-video alignment in diffusion-based video synthesis, without requiring differentiable reward functions.

Contribution

It proposes a novel approach using path integral control principles to enable black-box LVLMs for text-video alignment, bypassing the need for training or differentiable rewards.

Findings

01

Significantly improves text-to-video alignment quality.

02

Supports ensembling of multiple reward models.

03

Operates with minimal computational overhead.

Abstract

Diffusion models have achieved impressive results in generative tasks for text-to-video (T2V) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependencies across frames. Existing reinforcement learning (RL)-based approaches to enhance text alignment often require differentiable reward functions trained for videos, hindering their scalability and applicability. In this paper, we propose \textbf{Free $^{2}$ Guide}, a novel gradient-free and training-free framework for aligning generated videos with text prompts. Specifically, leveraging principles from path integral control, Free $^{2}$ Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward models. To enable image-trained LVLMs to assess…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Advanced Vision and Imaging

MethodsDiffusion