UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot   Vision-Language Tasks

Yanan Sun; Zihan Zhong; Qi Fan; Chi-Keung Tang; Yu-Wing; Tai

arXiv:2306.04715·cs.CV·June 9, 2023·1 cites

UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot Vision-Language Tasks

Yanan Sun, Zihan Zhong, Qi Fan, Chi-Keung Tang, Yu-Wing, Tai

PDF

Open Access

TL;DR

This paper introduces UniBoost, a method that uses large-scale unsupervised unimodal pre-training to significantly improve zero-shot vision-language task performance, surpassing existing multimodal models like CLIP.

Contribution

The paper demonstrates that unimodal pre-training enhances zero-shot vision-language understanding, offering broader data coverage and reducing misalignment issues compared to joint multimodal training.

Findings

01

Unimodal pre-training outperforms CLIP-based models by 6.5% on PASCAL-5i.

02

Unimodal pre-training improves COCO-20i segmentation by 6.2%.

03

Models learn richer representations of images and text, boosting zero-shot performance.

Abstract

Large-scale joint training of multimodal models, e.g., CLIP, have demonstrated great performance in many vision-language tasks. However, image-text pairs for pre-training are restricted to the intersection of images and texts, limiting their ability to cover a large distribution of real-world data, where noise can also be introduced as misaligned pairs during pre-processing. Conversely, unimodal models trained on text or image data alone through unsupervised techniques can achieve broader coverage of diverse real-world data and are not constrained by the requirement of simultaneous presence of image and text. In this paper, we demonstrate that using large-scale unsupervised unimodal models as pre-training can enhance the zero-shot performance of image-text pair models. Our thorough studies validate that models pre-trained as such can learn rich representations of both modalities,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training