VIMI: Grounding Video Generation through Multi-modal Instruction
Yuwei Fang, Willi Menapace, Aliaksandr Siarohin, Tsai-Shien Chen,, Kuan-Chien Wang, Ivan Skorokhodov, Graham Neubig, Sergey Tulyakov

TL;DR
VIMI introduces a multimodal grounded video generation model trained on a large-scale dataset, enabling diverse, coherent, and semantically controlled video synthesis from multi-modal prompts, achieving state-of-the-art results.
Contribution
The paper presents a novel two-stage training framework for grounded video generation using multimodal prompts, addressing the lack of large-scale datasets and improving multimodal understanding.
Findings
VIMI achieves state-of-the-art results on UCF101 benchmark.
The model produces temporally coherent and semantically controlled videos.
VIMI demonstrates strong multimodal understanding and diverse video generation capabilities.
Abstract
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. This limitation stems from the absence of large-scale multimodal prompt video datasets, resulting in a lack of visual grounding and restricting their versatility and application in multimodal integration. To address this, we construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts and then utilize a two-stage training strategy to enable diverse video generation tasks within the same model. In the first stage, we propose a multimodal conditional video generation framework for pretraining on these augmented datasets, establishing a foundational model for grounded video generation. Secondly, we finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Subtitles and Audiovisual Media
MethodsDiffusion
