Understanding Masked Autoencoders From a Local Contrastive Perspective
Xiaoyu Yue, Lei Bai, Meng Wei, Jiangmiao Pang, Xihui Liu, Luping Zhou,, Wanli Ouyang

TL;DR
This paper offers a local contrastive perspective on Masked Autoencoders (MAE), revealing how they learn invariance and distribution consistency, and analyzes the roles of decoder and masking in MAE's success.
Contribution
It introduces LC-MAE, a new framework that analyzes MAE's reconstructive and contrastive aspects, providing insights into its mechanisms and design principles.
Findings
MAE learns invariance to random masking.
MAE maintains distribution consistency between tokens and images.
Decoder and masking play crucial roles in MAE's effectiveness.
Abstract
Masked AutoEncoder (MAE) has revolutionized the field of self-supervised learning with its simple yet effective masking and reconstruction strategies. However, despite achieving state-of-the-art performance across various downstream vision tasks, the underlying mechanisms that drive MAE's efficacy are less well-explored compared to the canonical contrastive learning paradigm. In this paper, we first propose a local perspective to explicitly extract a local contrastive form from MAE's reconstructive objective at the patch level. And then we introduce a new empirical framework, called Local Contrastive MAE (LC-MAE), to analyze both reconstructive and contrastive aspects of MAE. LC-MAE reveals that MAE learns invariance to random masking and ensures distribution consistency between the learned token embeddings and the original images. Furthermore, we dissect the contribution of the decoder…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
The authors provide an finding that MAE is local contrastive learning, though some previous paper express the same opinion. Moreover, the paper propose a new method by combining contrastive learning and MAE. Finally, the writing is good.
1. The paper should cite and compare with related work, like MST, iBoT, CMAE, and so on. They are contrastive + masked methods. 2. The main opinions of this paper is proposed by previous work. Hence, the authors should not ignore them. 3. Current experimental results cannot demonstrate the method is effective. The authors should show detection and segmentation experiments in MAE to fairly compare with MAE and other related work.
This work has explored the decoder’s role of MAE in helping the encoder learn “rich hidden representations” in a generative manner, uncovering the fact that the decoder part enables to learn local features.
### Comparison to prior works This work proposed a combination of Masked Image Modeling (MIM) and contrastive learning (CL) using Siamese architecture, however, this strategy has already been explored in several methods (iBOT [1], CAE [2], and CMAE [3]). Moreover, this work just commented on these works, not comparing the proposed methods with these prior works. It seems that outstanding points of this work do not exist compared to the prior works. ### Weak experiments 1) very short training
1. The writing is well and easy to follow. And the main messages based on the analysis look solid and insightful. 2. The representations learned by Uni-SSL (without masking techniques) are more similar to the mask models instead of contrastive models, which is interesting and verifies the analysis proposed in this paper.
1. As shown in Figure 3, the attention distance increases with the larger mask ratio. However, both the fine-tuning and linear accuracy are not monotonous in MAE. Is it possible to discuss the trade-off in the choice mask ratio based on the analysis in this paper? 2. In Figure 5(b), the fine-tuning accuracy of DINO is higher than MAE and Uni-SSL while in Table 2 it is the opposite. What differences have I missed? 3. The paper focuses on analyzing the behavior of the decoders in MAE. Is it possib
The motivation is interesting and the writing is good. The authors provide an interesting finding that MIM and CL methods have a close relationship. It motivates the authors to propose a new method by combining CL and MIM. Moreover, The experiments show the fair performance of the proposed method.
1. There are currently too few experiments to demonstrate the effectiveness of the approach. At present, the lack of experiments includes longer-epoch training, benchmark object detection, benchmark semantic segmentation, and other experiments. 2. The paper should outperform other contrastive + masked methods, e.g., MST[1], iBoT[2] which were proposed two years ago. [1] Mst: Masked self-supervised transformer for visual representation. NeurIPS2021. [2] iBOT: Image BERT Pre-Training with Online
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Image and Signal Denoising Methods · Advanced Image Processing Techniques
MethodsMasked autoencoder · Contrastive Learning
