Learning with Unmasked Tokens Drives Stronger Vision Learners
Taekyung Kim, Sanghyuk Chun, Byeongho Heo, Dongyoon Han

TL;DR
This paper proposes a novel improvement to masked image modeling by incorporating unmasked tokens into training, leading to more discriminative representations and significant performance gains on various vision tasks.
Contribution
The authors introduce a method that explicitly uses unmasked tokens during MIM pre-training, enhancing context learning and resulting in stronger vision representations.
Findings
Achieved 84.2% top-1 accuracy on ImageNet-1K with ViT-B.
Improved performance on semantic segmentation and fine-grained classification.
Enhanced model robustness across diverse evaluation metrics.
Abstract
Masked image modeling (MIM) has become a leading self-supervised learning strategy. MIMs such as Masked Autoencoder (MAE) learn strong representations by randomly masking input tokens for the encoder to process, with the decoder reconstructing the masked tokens to the input. However, MIM pre-trained encoders often exhibit a limited attention span, attributed to MIM's sole focus on regressing masked tokens only, which may impede the encoder's broader context learning. To tackle the limitation, we improve MIM by explicitly incorporating unmasked tokens into the training process. Specifically, our method enables the encoder to learn from broader context supervision, allowing unmasked tokens to experience broader contexts while the decoder reconstructs masked tokens. Thus, the encoded unmasked tokens are equipped with extensive contextual information, empowering masked tokens to leverage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
MethodsFocus · Mutual Information Machine/Mask Image Modeling
