Improve Supervised Representation Learning with Masked Image Modeling
Kaifeng Chen, Daniel Salz, Huiwen Chang, Kihyuk Sohn, Dilip Krishnan,, Mojtaba Seyedhosseini

TL;DR
This paper introduces a simple method to enhance supervised visual representation learning by integrating masked image modeling into existing training frameworks, leading to improved downstream task performance.
Contribution
It presents a straightforward approach to combine masked image modeling with supervised training of vision transformers, boosting representation quality without additional inference costs.
Findings
Achieved 81.72% accuracy on ImageNet-1k with ViT-B/14, surpassing the baseline by 2.01%.
Improved image retrieval performance by 1.32% on ImageNet-1k.
Method scales effectively to larger models and datasets.
Abstract
Training visual embeddings with labeled data supervision has been the de facto setup for representation learning in computer vision. Inspired by recent success of adopting masked image modeling (MIM) in self-supervised representation learning, we propose a simple yet effective setup that can easily integrate MIM into existing supervised training paradigms. In our design, in addition to the original classification task applied to a vision transformer image encoder, we add a shallow transformer-based decoder on top of the encoder and introduce an MIM task which tries to reconstruct image tokens based on masked image inputs. We show with minimal change in architecture and no overhead in inference that this setup is able to improve the quality of the learned representations for downstream tasks such as classification, image retrieval, and semantic segmentation. We conduct a comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Linear Layer · Softmax · Attention Is All You Need · Residual Connection · Dense Connections · Layer Normalization · Vision Transformer · Mutual Information Machine/Mask Image Modeling
