GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via   Blender-Oriented GPT Planning

Jiaxi Lv; Yi Huang; Mingfu Yan; Jiancheng Huang and; Jianzhuang Liu; Yifan Liu; Yafei Wen; Xiaoxin Chen; Shifeng Chen

arXiv:2311.12631·cs.CV·April 24, 2024·1 cites

GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning

Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang and, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, Shifeng Chen

PDF

Open Access

TL;DR

GPT4Motion integrates large language models, Blender physics, and diffusion-based image generation to produce coherent, high-quality videos from text prompts without additional training, addressing computational costs and motion consistency issues.

Contribution

It introduces a training-free framework that plans physical motions via GPT-4 and Blender, improving text-to-video synthesis quality and coherence.

Findings

01

Effective in scenarios like object drop, collision, cloth draping, swinging, and liquid flow.

02

Produces videos with high motion coherence and entity consistency.

03

Operates efficiently without additional training.

Abstract

Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Discriminative Fine-Tuning · Attention Dropout · Weight Decay · Cosine Annealing · Residual Connection · Position-Wise Feed-Forward Layer