VIMI: Grounding Video Generation through Multi-modal Instruction

Yuwei Fang; Willi Menapace; Aliaksandr Siarohin; Tsai-Shien Chen,; Kuan-Chien Wang; Ivan Skorokhodov; Graham Neubig; Sergey Tulyakov

arXiv:2407.06304·cs.CV·July 10, 2024

VIMI: Grounding Video Generation through Multi-modal Instruction

Yuwei Fang, Willi Menapace, Aliaksandr Siarohin, Tsai-Shien Chen,, Kuan-Chien Wang, Ivan Skorokhodov, Graham Neubig, Sergey Tulyakov

PDF

Open Access 1 Video

TL;DR

VIMI introduces a multimodal grounded video generation model trained on a large-scale dataset, enabling diverse, coherent, and semantically controlled video synthesis from multi-modal prompts, achieving state-of-the-art results.

Contribution

The paper presents a novel two-stage training framework for grounded video generation using multimodal prompts, addressing the lack of large-scale datasets and improving multimodal understanding.

Findings

01

VIMI achieves state-of-the-art results on UCF101 benchmark.

02

The model produces temporally coherent and semantically controlled videos.

03

VIMI demonstrates strong multimodal understanding and diverse video generation capabilities.

Abstract

Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. This limitation stems from the absence of large-scale multimodal prompt video datasets, resulting in a lack of visual grounding and restricting their versatility and application in multimodal integration. To address this, we construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts and then utilize a two-stage training strategy to enable diverse video generation tasks within the same model. In the first stage, we propose a multimodal conditional video generation framework for pretraining on these augmented datasets, establishing a foundational model for grounded video generation. Secondly, we finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VIMI: Grounding Video Generation through Multi-modal Instruction· underline

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Subtitles and Audiovisual Media

MethodsDiffusion