Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search

Yuta Oshima; Masahiro Suzuki; Yutaka Matsuo; Hiroki Furuta

arXiv:2501.19252·cs.CV·October 8, 2025

Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search

Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a diffusion latent beam search method with lookahead estimator for text-to-video generation, improving perceptual alignment and quality at inference time without model retraining.

Contribution

It proposes a novel inference-time search technique that enhances video quality and alignment by optimizing a calibrated reward, outperforming existing methods in efficiency and quality.

Findings

01

Improves perceptual quality based on calibrated rewards

02

Outperforms greedy and best-of-N sampling methods

03

Requires no additional model training

Abstract

The remarkable progress in text-to-video diffusion models enables the generation of photorealistic videos, although the content of these generated videos often includes unnatural movement or deformation, reverse playback, and motionless scenes. Recently, an alignment problem has attracted huge attention, where we steer the output of diffusion models based on some measure of the content's goodness. Because there is a large room for improvement of perceptual quality along the frame direction, we should address which metrics we should optimize and how we can optimize them in the video generation. In this paper, we propose diffusion latent beam search with lookahead estimator, which can select a better diffusion latent to maximize a given alignment reward at inference time. We then point out that improving perceptual video quality with respect to alignment to prompts requires reward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shim0114/T2V-Diffusion-Search
pytorch

Videos

Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search· slideslive

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Music and Audio Processing

MethodsDiffusion