MVP: Multimodality-guided Visual Pre-training

Longhui Wei; Lingxi Xie; Wengang Zhou; Houqiang Li; Qi Tian

arXiv:2203.05175·cs.CV·March 11, 2022·1 cites

MVP: Multimodality-guided Visual Pre-training

Longhui Wei, Lingxi Xie, Wengang Zhou, Houqiang Li, Qi Tian

PDF

Open Access

TL;DR

This paper introduces MVP, a multimodality-guided visual pre-training method that leverages vision-language models like CLIP to enhance masked image modeling, resulting in significant improvements in visual recognition tasks.

Contribution

MVP is the first to incorporate multimodal guidance from vision-language models into masked image modeling for visual pre-training.

Findings

01

MVP achieves 52.4% mIoU on ADE20K, surpassing previous state-of-the-art BEIT.

02

Pre-training ViT-Base/16 for 300 epochs with MVP improves downstream task performance.

03

Multimodal guidance significantly boosts the effectiveness of visual pre-training.

Abstract

Recently, masked image modeling (MIM) has become a promising direction for visual pre-training. In the context of vision transformers, MIM learns effective visual representation by aligning the token-level features with a pre-defined space (e.g., BEIT used a d-VAE trained on a large image corpus as the tokenizer). In this paper, we go one step further by introducing guidance from other modalities and validating that such additional knowledge leads to impressive gains for visual pre-training. The proposed approach is named Multimodality-guided Visual Pre-training (MVP), in which we replace the tokenizer with the vision branch of CLIP, a vision-language model pre-trained on 400 million image-text pairs. We demonstrate the effectiveness of MVP by performing standard experiments, i.e., pre-training the ViT models on ImageNet and fine-tuning them on a series of downstream visual recognition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training