Masked Autoencoders Are Scalable Vision Learners

Kaiming He; Xinlei Chen; Saining Xie; Yanghao Li; Piotr Doll\'ar; Ross; Girshick

arXiv:2111.06377·cs.CV·December 21, 2021·193 cites

Masked Autoencoders Are Scalable Vision Learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll\'ar, Ross, Girshick

PDF

Open Access 5 Repos 10 Models 1 Video

TL;DR

This paper demonstrates that masked autoencoders with a simple asymmetric architecture and high masking ratios are effective, scalable self-supervised learning methods for vision, achieving state-of-the-art results on ImageNet-1K.

Contribution

It introduces a scalable masked autoencoder framework with an asymmetric encoder-decoder design and high masking ratios, enabling efficient training of large vision models.

Findings

01

Accelerates training by 3x or more.

02

Achieves 87.8% accuracy with ViT-Huge on ImageNet-1K.

03

Outperforms supervised pre-training in transfer tasks.

Abstract

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Masked Autoencoders Are Scalable Vision Learners – Paper explained and animated!· youtube

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications

MethodsMasked autoencoder · Contrastive Cross-View Mutual Information Maximization