Show Me What and Tell Me How: Video Synthesis via Multimodal   Conditioning

Ligong Han; Jian Ren; Hsin-Ying Lee; Francesco Barbieri and; Kyle Olszewski; Shervin Minaee; Dimitris Metaxas; Sergey Tulyakov

arXiv:2203.02573·cs.CV·March 8, 2022

Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Ligong Han, Jian Ren, Hsin-Ying Lee, Francesco Barbieri and, Kyle Olszewski, Shervin Minaee, Dimitris Metaxas, Sergey Tulyakov

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multimodal video synthesis framework that combines text and images to generate diverse, high-quality videos, overcoming limitations of single-modality methods by enabling detailed control over content and motion.

Contribution

It presents a novel multimodal approach using a bidirectional transformer and new training techniques to improve video quality, diversity, and length, with state-of-the-art results on multiple datasets.

Findings

01

Achieved state-of-the-art results on four datasets.

02

Generated longer, more diverse videos than previous methods.

03

Effectively incorporated various visual modalities for flexible video synthesis.

Abstract

Most methods for conditional video synthesis use a single modality as the condition. This comes with major limitations. For example, it is problematic for a model conditioned on an image to generate a specific motion trajectory desired by the user since there is no means to provide motion information. Conversely, language information can describe the desired motion, while not precisely defining the content of the video. This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately. We leverage the recent progress in quantized representations for videos and apply a bidirectional transformer with multiple modalities as inputs to predict a discrete video representation. To improve video quality and consistency, we propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

snap-research/mmvid
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Video Analysis and Summarization

MethodsSelf-Learning