Tell Me What Happened: Unifying Text-guided Video Completion via   Multimodal Masked Video Generation

Tsu-Jui Fu; Licheng Yu; Ning Zhang; Cheng-Yang Fu; Jong-Chyi Su,; William Yang Wang; Sean Bell

arXiv:2211.12824·cs.CV·March 23, 2023

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Tsu-Jui Fu, Licheng Yu, Ning Zhang, Cheng-Yang Fu, Jong-Chyi Su,, William Yang Wang, Sean Bell

PDF

Open Access 1 Repo

TL;DR

This paper introduces a unified model, MMVG, for text-guided video completion tasks including prediction, rewind, and infilling, by leveraging masked video generation and visual token discretization.

Contribution

The paper proposes a novel multimodal masked video generation approach that unifies various video completion tasks under a single framework guided by natural language.

Findings

01

Effective in diverse scenarios including egocentric, animation, and gaming videos.

02

Generates high-quality visual appearances aligned with text instructions.

03

Single model handles prediction, rewind, and infilling tasks seamlessly.

Abstract

Generating a video given the first several static frames is challenging as it anticipates reasonable future frames with temporal coherence. Besides video prediction, the ability to rewind from the last frame or infilling between the head and tail is also crucial, but they have rarely been explored for video completion. Since there could be different outcomes from the hints of just a few frames, a system that can follow natural language to perform video completion may significantly improve controllability. Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction. We then propose Multimodal Masked Video Generation (MMVG) to address this TVC task. During training, MMVG discretizes the video frames into visual tokens and masks most of them to perform video completion from any…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tsujuifu/pytorch_tvc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging