Masked Autoencoders Are Scalable Vision Learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll\'ar, Ross, Girshick

TL;DR
This paper demonstrates that masked autoencoders with a simple asymmetric architecture and high masking ratios are effective, scalable self-supervised learning methods for vision, achieving state-of-the-art results on ImageNet-1K.
Contribution
It introduces a scalable masked autoencoder framework with an asymmetric encoder-decoder design and high masking ratios, enabling efficient training of large vision models.
Findings
Accelerates training by 3x or more.
Achieves 87.8% accuracy with ViT-Huge on ImageNet-1K.
Outperforms supervised pre-training in transfer tasks.
Abstract
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/dinov2-with-registers-largemodel· 113k dl· ♡ 12113k dl♡ 12
- 🤗facebook/vit-mae-basemodel· 506k dl· ♡ 39506k dl♡ 39
- 🤗facebook/vit-mae-hugemodel· 6.2k dl· ♡ 66.2k dl♡ 6
- 🤗facebook/vit-mae-largemodel· 11k dl· ♡ 1011k dl♡ 10
- 🤗Team-PIXEL/pixel-basemodel· 74 dl· ♡ 3774 dl♡ 37
- 🤗MCG-NJU/videomae-base-shortmodel· 693 dl· ♡ 4693 dl♡ 4
- 🤗MCG-NJU/videomae-base-finetuned-kineticsmodel· 25k dl· ♡ 4725k dl♡ 47
- 🤗MCG-NJU/videomae-base-short-finetuned-kineticsmodel· 1.2k dl· ♡ 31.2k dl♡ 3
- 🤗MCG-NJU/videomae-largemodel· 3.2k dl· ♡ 373.2k dl♡ 37
- 🤗MCG-NJU/videomae-large-finetuned-kineticsmodel· 6.9k dl· ♡ 136.9k dl♡ 13
Videos
Masked Autoencoders Are Scalable Vision Learners – Paper explained and animated!· youtube
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications
MethodsMasked autoencoder · Contrastive Cross-View Mutual Information Maximization
