UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot Vision-Language Tasks
Yanan Sun, Zihan Zhong, Qi Fan, Chi-Keung Tang, Yu-Wing, Tai

TL;DR
This paper introduces UniBoost, a method that uses large-scale unsupervised unimodal pre-training to significantly improve zero-shot vision-language task performance, surpassing existing multimodal models like CLIP.
Contribution
The paper demonstrates that unimodal pre-training enhances zero-shot vision-language understanding, offering broader data coverage and reducing misalignment issues compared to joint multimodal training.
Findings
Unimodal pre-training outperforms CLIP-based models by 6.5% on PASCAL-5i.
Unimodal pre-training improves COCO-20i segmentation by 6.2%.
Models learn richer representations of images and text, boosting zero-shot performance.
Abstract
Large-scale joint training of multimodal models, e.g., CLIP, have demonstrated great performance in many vision-language tasks. However, image-text pairs for pre-training are restricted to the intersection of images and texts, limiting their ability to cover a large distribution of real-world data, where noise can also be introduced as misaligned pairs during pre-processing. Conversely, unimodal models trained on text or image data alone through unsupervised techniques can achieve broader coverage of diverse real-world data and are not constrained by the requirement of simultaneous presence of image and text. In this paper, we demonstrate that using large-scale unsupervised unimodal models as pre-training can enhance the zero-shot performance of image-text pair models. Our thorough studies validate that models pre-trained as such can learn rich representations of both modalities,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
