MVP: Multimodality-guided Visual Pre-training
Longhui Wei, Lingxi Xie, Wengang Zhou, Houqiang Li, Qi Tian

TL;DR
This paper introduces MVP, a multimodality-guided visual pre-training method that leverages vision-language models like CLIP to enhance masked image modeling, resulting in significant improvements in visual recognition tasks.
Contribution
MVP is the first to incorporate multimodal guidance from vision-language models into masked image modeling for visual pre-training.
Findings
MVP achieves 52.4% mIoU on ADE20K, surpassing previous state-of-the-art BEIT.
Pre-training ViT-Base/16 for 300 epochs with MVP improves downstream task performance.
Multimodal guidance significantly boosts the effectiveness of visual pre-training.
Abstract
Recently, masked image modeling (MIM) has become a promising direction for visual pre-training. In the context of vision transformers, MIM learns effective visual representation by aligning the token-level features with a pre-defined space (e.g., BEIT used a d-VAE trained on a large image corpus as the tokenizer). In this paper, we go one step further by introducing guidance from other modalities and validating that such additional knowledge leads to impressive gains for visual pre-training. The proposed approach is named Multimodality-guided Visual Pre-training (MVP), in which we replace the tokenizer with the vision branch of CLIP, a vision-language model pre-trained on 400 million image-text pairs. We demonstrate the effectiveness of MVP by performing standard experiments, i.e., pre-training the ViT models on ImageNet and fine-tuning them on a series of downstream visual recognition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
