MILAN: Masked Image Pretraining on Language Assisted Representation
Zejiang Hou, Fei Sun, Yen-Kuang Chen, Yuan Xie, Sun-Yuan Kung

TL;DR
MILAN introduces a novel masked image pretraining method that leverages language-assisted semantic features from captions, leading to improved transfer learning performance in vision tasks.
Contribution
The paper proposes MILAN, a new masked image pretraining approach using caption-based semantic features, with a specialized decoder and mask sampling, outperforming previous methods.
Findings
Achieves 85.4% top-1 accuracy on ImageNet-1K with ViT-Base.
Outperforms previous masked pretraining methods in semantic segmentation by 4 mIoU points.
Demonstrates superior transfer learning capabilities across vision tasks.
Abstract
Self-attention based transformer models have been dominating many computer vision tasks in the past few years. Their superb model qualities heavily depend on the excessively large labeled image datasets. In order to reduce the reliance on large labeled datasets, reconstruction based masked autoencoders are gaining popularity, which learn high quality transferable representations from unlabeled images. For the same purpose, recent weakly supervised image pretraining methods explore language supervision from text captions accompanying the images. In this work, we propose masked image pretraining on language assisted representation, dubbed as MILAN. Instead of predicting raw pixels or low level features, our pretraining objective is to reconstruct the image features with substantial semantic signals that are obtained using caption supervision. Moreover, to accommodate our reconstruction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsAttentive Walk-Aggregating Graph Neural Network
