WorldDreamer: Towards General World Models for Video Generation via   Predicting Masked Tokens

Xiaofeng Wang; Zheng Zhu; Guan Huang; Boyuan Wang; Xinze Chen; Jiwen; Lu

arXiv:2401.09985·cs.CV·January 19, 2024·5 cites

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

Xiaofeng Wang, Zheng Zhu, Guan Huang, Boyuan Wang, Xinze Chen, Jiwen, Lu

PDF

Open Access

TL;DR

WorldDreamer introduces a versatile, large-scale world model for video generation that predicts masked visual tokens, enabling natural scene and driving environment synthesis, and supports multi-modal prompts for interactive tasks.

Contribution

It presents a novel unsupervised visual sequence modeling approach inspired by language models, extending world modeling to general environments beyond specific scenarios.

Findings

01

Excels in generating diverse videos including natural and driving scenes

02

Supports tasks like text-to-video, image-to-video, and video editing

03

Demonstrates versatility and effectiveness in capturing dynamic world elements

Abstract

World models play a crucial role in understanding and predicting the dynamics of the world, which is essential for video generation. However, existing world models are confined to specific scenarios such as gaming or driving, limiting their ability to capture the complexity of general world dynamic environments. Therefore, we introduce WorldDreamer, a pioneering world model to foster a comprehensive comprehension of general world physics and motions, which significantly enhances the capabilities of video generation. Drawing inspiration from the success of large language models, WorldDreamer frames world modeling as an unsupervised visual sequence modeling challenge. This is achieved by mapping visual inputs to discrete tokens and predicting the masked ones. During this process, we incorporate multi-modal prompts to facilitate interaction within the world model. Our experiments show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Video Analysis and Summarization