MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images
Junchen Zhu, Huan Yang, Huiguo He, Wenjing Wang, Zixi Tuo, Wen-Huang, Cheng, Lianli Gao, Jingkuan Song, Jianlong Fu

TL;DR
MovieFactory is an innovative framework that automatically generates multi-scene, cinematic movies with synchronized sound from natural language inputs, utilizing large generative models for both images and audio.
Contribution
It introduces the first fully automated movie generation system that creates high-quality, multi-modality movies from simple text, surpassing previous soundless and single-scene methods.
Findings
Produces realistic, diverse, multi-scene movies with synchronized audio
Uses a two-stage process for video generation involving spatial finetuning and temporal learning
Demonstrates high-quality results with immersive visual and auditory experiences
Abstract
In this paper, we present MovieFactory, a powerful framework to generate cinematic-picture (30721280), film-style (multi-scene), and multi-modality (sounding) movies on the demand of natural languages. As the first fully automated movie generation model to the best of our knowledge, our approach empowers users to create captivating movies with smooth transitions using simple text inputs, surpassing existing methods that produce soundless videos limited to a single scene of modest quality. To facilitate this distinctive functionality, we leverage ChatGPT to expand user-provided text into detailed sequential scripts for movie generation. Then we bring scripts to life visually and acoustically through vision generation and audio retrieval. To generate videos, we extend the capabilities of a pretrained text-to-image diffusion model through a two-stage process. Firstly, we employ…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Music and Audio Processing
MethodsALIGN · Diffusion
