How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders
Qi Zhang, Yifei Wang, Yisen Wang

TL;DR
This paper provides a theoretical framework for understanding Masked Autoencoders (MAE), linking them to contrastive learning, analyzing the impact of mask ratio, and proposing a new loss to improve their performance and address dimensional collapse.
Contribution
It establishes a theoretical connection between MAE and contrastive learning, introduces downstream guarantees, and proposes U-MAE to enhance performance and stability.
Findings
MAE implicitly aligns mask-induced positive pairs
U-MAE effectively addresses dimensional collapse
Significant improvements on CIFAR-10, ImageNet-100, and ImageNet-1K
Abstract
Masked Autoencoders (MAE) based on a reconstruction task have risen to be a promising paradigm for self-supervised learning (SSL) and achieve state-of-the-art performance across different benchmark datasets. However, despite its impressive empirical success, there is still limited theoretical understanding of it. In this paper, we propose a theoretical understanding of how masking matters for MAE to learn meaningful features. We establish a close connection between MAE and contrastive learning, which shows that MAE implicit aligns the mask-induced positive pairs. Built upon this connection, we develop the first downstream guarantees for MAE methods, and analyze the effect of mask ratio. Besides, as a result of the implicit alignment, we also point out the dimensional collapse issue of MAE, and propose a Uniformity-enhanced MAE (U-MAE) loss that can effectively address this issue and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · AI in cancer detection
MethodsMasked autoencoder
