Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search
Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta

TL;DR
This paper introduces a diffusion latent beam search method with lookahead estimator for text-to-video generation, improving perceptual alignment and quality at inference time without model retraining.
Contribution
It proposes a novel inference-time search technique that enhances video quality and alignment by optimizing a calibrated reward, outperforming existing methods in efficiency and quality.
Findings
Improves perceptual quality based on calibrated rewards
Outperforms greedy and best-of-N sampling methods
Requires no additional model training
Abstract
The remarkable progress in text-to-video diffusion models enables the generation of photorealistic videos, although the content of these generated videos often includes unnatural movement or deformation, reverse playback, and motionless scenes. Recently, an alignment problem has attracted huge attention, where we steer the output of diffusion models based on some measure of the content's goodness. Because there is a large room for improvement of perceptual quality along the frame direction, we should address which metrics we should optimize and how we can optimize them in the video generation. In this paper, we propose diffusion latent beam search with lookahead estimator, which can select a better diffusion latent to maximize a given alignment reward at inference time. We then point out that improving perceptual video quality with respect to alignment to prompts requires reward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Music and Audio Processing
MethodsDiffusion
