MaskViT: Masked Visual Pre-Training for Video Prediction
Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto, Mart\'in-Mart\'in, Li Fei-Fei

TL;DR
MaskViT introduces a masked visual transformer pre-training approach for video prediction, utilizing spatial and spatiotemporal window attention and variable masking during training, leading to improved accuracy and efficiency in high-resolution video generation and robotic planning.
Contribution
This work presents MaskViT, a novel masked visual pre-training method for video prediction that employs window attention and dynamic masking, enhancing performance and inference speed.
Findings
Outperforms prior video prediction models on multiple datasets.
Generates high-resolution videos up to 256x256.
Achieves up to 512x inference speedup in robotic planning.
Abstract
The ability to predict future visual observations conditioned on past observations and motor commands can enable embodied agents to plan solutions to a variety of tasks in complex environments. This work shows that we can create good video prediction models by pre-training transformers via masked visual modeling. Our approach, named MaskViT, is based on two simple design decisions. First, for memory and training efficiency, we use two types of window attention: spatial and spatiotemporal. Second, during training, we mask a variable percentage of tokens instead of a fixed mask ratio. For inference, MaskViT generates all tokens via iterative refinement where we incrementally decrease the masking ratio following a mask scheduling function. On several datasets we demonstrate that MaskViT outperforms prior works in video prediction, is parameter efficient, and can generate high-resolution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
