Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement
Yang Liu, Xilin Zhao, Peisong Wen, Siran Dai, Qingming Huang

TL;DR
This paper introduces a training-free, iterative self-refinement framework that uses vision-language models to improve the physical realism of generated videos, demonstrated by significant score improvements on the PhyIQ benchmark.
Contribution
It presents a novel, physics-aware, plug-and-play refinement method for video generation that leverages large language and vision-language models without additional training.
Findings
Improves Physics-IQ score from 56.31 to 62.38 on PhyIQ benchmark.
Introduces a multimodal chain-of-thought process for physics-guided refinement.
Demonstrates effectiveness across various video generation models.
Abstract
Recent progress in video generation has led to impressive visual quality, yet current models still struggle to produce results that align with real-world physical principles. To this end, we propose an iterative self-refinement framework that leverages large language models and vision-language models to provide physics-aware guidance for video generation. Specifically, we introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies, progressively enhancing generation quality. This method is training-free and plug-and-play, making it readily applicable to a wide range of video generation models. Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38. We hope this work serves as a preliminary exploration of physics-consistent video generation and may offer insights for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
