MILAN: Masked Image Pretraining on Language Assisted Representation

Zejiang Hou; Fei Sun; Yen-Kuang Chen; Yuan Xie; Sun-Yuan Kung

arXiv:2208.06049·cs.CV·December 21, 2022·27 cites

MILAN: Masked Image Pretraining on Language Assisted Representation

Zejiang Hou, Fei Sun, Yen-Kuang Chen, Yuan Xie, Sun-Yuan Kung

PDF

Open Access 1 Repo

TL;DR

MILAN introduces a novel masked image pretraining method that leverages language-assisted semantic features from captions, leading to improved transfer learning performance in vision tasks.

Contribution

The paper proposes MILAN, a new masked image pretraining approach using caption-based semantic features, with a specialized decoder and mask sampling, outperforming previous methods.

Findings

01

Achieves 85.4% top-1 accuracy on ImageNet-1K with ViT-Base.

02

Outperforms previous masked pretraining methods in semantic segmentation by 4 mIoU points.

03

Demonstrates superior transfer learning capabilities across vision tasks.

Abstract

Self-attention based transformer models have been dominating many computer vision tasks in the past few years. Their superb model qualities heavily depend on the excessively large labeled image datasets. In order to reduce the reliance on large labeled datasets, reconstruction based masked autoencoders are gaining popularity, which learn high quality transferable representations from unlabeled images. For the same purpose, recent weakly supervised image pretraining methods explore language supervision from text captions accompanying the images. In this work, we propose masked image pretraining on language assisted representation, dubbed as MILAN. Instead of predicting raw pixels or low level features, our pretraining objective is to reconstruct the image features with substantial semantic signals that are obtained using caption supervision. Moreover, to accommodate our reconstruction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zejiangh/milan
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsAttentive Walk-Aggregating Graph Neural Network