VideoMAE: Masked Autoencoders are Data-Efficient Learners for   Self-Supervised Video Pre-Training

Zhan Tong; Yibing Song; Jue Wang; Limin Wang

arXiv:2203.12602·cs.CV·October 19, 2022·433 cites

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Zhan Tong, Yibing Song, Jue Wang, Limin Wang

PDF

Open Access 5 Repos 10 Models 1 Video

TL;DR

VideoMAE demonstrates that high masking ratios and data quality are key to efficient self-supervised video pre-training, achieving strong results on various datasets without extra data.

Contribution

The paper introduces VideoMAE with high masking ratios and customized tube masking, significantly improving data efficiency in self-supervised video pre-training.

Findings

01

High masking ratios (90-95%) are effective due to video redundancy.

02

Strong performance on small datasets without extra data.

03

Data quality outweighs data quantity for pre-training.

Abstract

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning