Understanding Masked Image Modeling via Learning Occlusion Invariant Feature
Xiangwen Kong, Xiangyu Zhang

TL;DR
This paper reveals that Masked Image Modeling (MIM) implicitly learns occlusion-invariant features, providing a new understanding of its success and unifying it with other siamese self-supervised learning methods.
Contribution
It introduces a new perspective that MIM learns occlusion-invariant features and unifies MIM with siamese approaches, clarifying the underlying mechanisms.
Findings
MIM can be interpreted as learning occlusion-invariant features.
The success of MIM is more related to learned features than similarity functions.
Occlusion-invariant features serve as a good initialization for vision transformers.
Abstract
Recently, Masked Image Modeling (MIM) achieves great success in self-supervised visual recognition. However, as a reconstruction-based framework, it is still an open question to understand how MIM works, since MIM appears very different from previous well-studied siamese approaches such as contrastive learning. In this paper, we propose a new viewpoint: MIM implicitly learns occlusion-invariant features, which is analogous to other siamese methods while the latter learns other invariance. By relaxing MIM formulation into an equivalent siamese form, MIM methods can be interpreted in a unified framework with conventional methods, among which only a) data transformations, i.e. what invariance to learn, and b) similarity measurements are different. Furthermore, taking MAE (He et al.) as a representative example of MIM, we empirically find the success of MIM models relates a little to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsMasked autoencoder · Mutual Information Machine/Mask Image Modeling
