GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang and, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, Shifeng Chen

TL;DR
GPT4Motion integrates large language models, Blender physics, and diffusion-based image generation to produce coherent, high-quality videos from text prompts without additional training, addressing computational costs and motion consistency issues.
Contribution
It introduces a training-free framework that plans physical motions via GPT-4 and Blender, improving text-to-video synthesis quality and coherence.
Findings
Effective in scenarios like object drop, collision, cloth draping, swinging, and liquid flow.
Produces videos with high motion coherence and entity consistency.
Operates efficiently without additional training.
Abstract
Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Discriminative Fine-Tuning · Attention Dropout · Weight Decay · Cosine Annealing · Residual Connection · Position-Wise Feed-Forward Layer
