VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong, Yibing Song, Jue Wang, Limin Wang

TL;DR
VideoMAE demonstrates that high masking ratios and data quality are key to efficient self-supervised video pre-training, achieving strong results on various datasets without extra data.
Contribution
The paper introduces VideoMAE with high masking ratios and customized tube masking, significantly improving data efficiency in self-supervised video pre-training.
Findings
High masking ratios (90-95%) are effective due to video redundancy.
Strong performance on small datasets without extra data.
Data quality outweighs data quantity for pre-training.
Abstract
Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗MCG-NJU/videomae-base-shortmodel· 693 dl· ♡ 4693 dl♡ 4
- 🤗MCG-NJU/videomae-base-finetuned-kineticsmodel· 25k dl· ♡ 4725k dl♡ 47
- 🤗MCG-NJU/videomae-base-short-finetuned-kineticsmodel· 1.2k dl· ♡ 31.2k dl♡ 3
- 🤗MCG-NJU/videomae-largemodel· 3.2k dl· ♡ 373.2k dl♡ 37
- 🤗MCG-NJU/videomae-large-finetuned-kineticsmodel· 6.9k dl· ♡ 136.9k dl♡ 13
- 🤗MCG-NJU/videomae-base-short-ssv2model· 13 dl· ♡ 213 dl♡ 2
- 🤗MCG-NJU/videomae-base-short-finetuned-ssv2model· 8 dl· ♡ 18 dl♡ 1
- 🤗MCG-NJU/videomae-base-ssv2model· 593 dl· ♡ 2593 dl♡ 2
- 🤗MCG-NJU/videomae-base-finetuned-ssv2model· 1.2k dl· ♡ 71.2k dl♡ 7
- 🤗MCG-NJU/videomae-basemodel· 76k dl· ♡ 5076k dl♡ 50
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
