Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning
Ligong Han, Jian Ren, Hsin-Ying Lee, Francesco Barbieri and, Kyle Olszewski, Shervin Minaee, Dimitris Metaxas, Sergey Tulyakov

TL;DR
This paper introduces a multimodal video synthesis framework that combines text and images to generate diverse, high-quality videos, overcoming limitations of single-modality methods by enabling detailed control over content and motion.
Contribution
It presents a novel multimodal approach using a bidirectional transformer and new training techniques to improve video quality, diversity, and length, with state-of-the-art results on multiple datasets.
Findings
Achieved state-of-the-art results on four datasets.
Generated longer, more diverse videos than previous methods.
Effectively incorporated various visual modalities for flexible video synthesis.
Abstract
Most methods for conditional video synthesis use a single modality as the condition. This comes with major limitations. For example, it is problematic for a model conditioned on an image to generate a specific motion trajectory desired by the user since there is no means to provide motion information. Conversely, language information can describe the desired motion, while not precisely defining the content of the video. This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately. We leverage the recent progress in quantized representations for videos and apply a bidirectional transformer with multiple modalities as inputs to predict a discrete video representation. To improve video quality and consistency, we propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Video Analysis and Summarization
MethodsSelf-Learning
