Moonshot: Towards Controllable Video Generation and Editing with   Multimodal Conditions

David Junhao Zhang; Dongxu Li; Hung Le; Mike Zheng Shou; Caiming; Xiong; Doyen Sahoo

arXiv:2401.01827·cs.CV·January 4, 2024·2 cites

Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming, Xiong, Doyen Sahoo

PDF

Open Access 2 Repos

TL;DR

Moonshot introduces a multimodal video generation model that allows control over appearance and geometry through image and text inputs, improving quality and versatility in video synthesis tasks.

Contribution

The paper presents Moonshot, a novel video diffusion model with multimodal conditioning and the ability to incorporate pre-trained image control modules without extra training.

Findings

01

Significant improvement in visual quality and temporal consistency.

02

Versatile multimodal conditioning enables diverse applications.

03

Easy integration with existing image control modules.

Abstract

Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB), which consists of conventional spatialtemporal layers for representing video features, and a decoupled cross-attention layer to address image and text inputs for appearance conditioning. In addition, we carefully design the model architecture such that it can optionally integrate with pre-trained image ControlNet modules for geometry visual conditions, without needing of extra training overhead as opposed to prior methods. Experiments show that with versatile multimodal conditioning mechanisms,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis

MethodsDiffusion