Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback
Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, Sherry Yang

TL;DR
This paper explores using AI-driven perceptual feedback from vision-language models to improve the realism of dynamic object interactions in text-to-video generation, addressing movement accuracy and physics violations.
Contribution
It introduces a method leveraging vision-language models for feedback to enhance object dynamics, demonstrating significant improvements over traditional metrics.
Findings
AI feedback significantly improves interaction scene quality.
Vision-language model signals outperform traditional video metrics.
Substantial gains in complex multi-object interactions and falling objects scenarios.
Abstract
Large text-to-video models hold immense potential for a wide range of downstream applications. However, they struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. In this work, we investigate the use of feedback to enhance the quality of object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively overcome movement misalignment and realistic object interactions? We first point out that offline RL-finetuning algorithms for text-to-video models can be equivalent as derived from a unified probabilistic objective. This perspective highlights that there is no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
