A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond
Chaoning Zhang, Chenshuang Zhang, Junha Song, John Seon Keun Yi, Kang, Zhang, In So Kweon

TL;DR
This paper provides a comprehensive survey of masked autoencoders, highlighting their role in self-supervised learning for vision and their potential to bridge the gap with NLP methods like BERT.
Contribution
It is the first survey to review SSL with masked autoencoders in vision, covering historical development, recent progress, and future implications.
Findings
Masked autoencoders have revived interest in generative SSL in vision.
They show promise in bridging vision and NLP SSL techniques.
The survey discusses diverse applications and future directions.
Abstract
Masked autoencoders are scalable vision learners, as the title of MAE \cite{he2022masked}, which suggests that self-supervised learning (SSL) in vision might undertake a similar trajectory as in NLP. Specifically, generative pretext tasks with the masked prediction (e.g., BERT) have become a de facto standard SSL practice in NLP. By contrast, early attempts at generative methods in vision have been buried by their discriminative counterparts (like contrastive learning); however, the success of mask image modeling has revived the masking autoencoder (often termed denoising autoencoder in the past). As a milestone to bridge the gap with BERT in NLP, masked autoencoder has attracted unprecedented attention for SSL in vision and beyond. This work conducts a comprehensive survey of masked autoencoders to shed insight on a promising direction of SSL. As the first to review SSL with masked…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Masked autoencoder · Linear Layer · Layer Normalization · Adam · WordPiece · Weight Decay · Linear Warmup With Linear Decay · Residual Connection
