Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

Hiroki Furuta; Heiga Zen; Dale Schuurmans; Aleksandra Faust; Yutaka Matsuo; Percy Liang; Sherry Yang

arXiv:2412.02617·cs.LG·April 21, 2026

Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, Sherry Yang

PDF

TL;DR

This paper explores using AI-driven perceptual feedback from vision-language models to improve the realism of dynamic object interactions in text-to-video generation, addressing movement accuracy and physics violations.

Contribution

It introduces a method leveraging vision-language models for feedback to enhance object dynamics, demonstrating significant improvements over traditional metrics.

Findings

01

AI feedback significantly improves interaction scene quality.

02

Vision-language model signals outperform traditional video metrics.

03

Substantial gains in complex multi-object interactions and falling objects scenarios.

Abstract

Large text-to-video models hold immense potential for a wide range of downstream applications. However, they struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. In this work, we investigate the use of feedback to enhance the quality of object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively overcome movement misalignment and realistic object interactions? We first point out that offline RL-finetuning algorithms for text-to-video models can be equivalent as derived from a unified probabilistic objective. This perspective highlights that there is no…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.