Is Multimodal Vision Supervision Beneficial to Language?
Avinash Madasu, Vasudev Lal

TL;DR
This paper investigates whether vision supervision enhances language representations in multimodal models, finding that vanilla text encoders often outperform vision-supervised ones on language understanding and reasoning tasks.
Contribution
It provides a comparative analysis showing that vision supervision does not necessarily improve language representations in current multimodal models.
Findings
Vanilla language representations outperform vision-supervised ones on most benchmarks.
Vision supervision does not significantly enhance language understanding in current models.
The study highlights limitations of current vision-language pre-training approaches.
Abstract
Vision (image and video) - Language (VL) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks like image-retrieval, video-retrieval, visual question answering etc. These models are trained in an unsupervised way and greatly benefit from the complementary modality supervision. In this paper, we explore if the language representations trained using vision supervision perform better than vanilla language representations on Natural Language Understanding and commonsense reasoning benchmarks. We experiment with a diverse set of image-text models such as ALBEF, BLIP, METER and video-text models like ALPRO, Frozen-in-Time (FiT), VIOLET. We compare the performance of language representations of stand-alone text encoders of these models to the language representations of text encoders learnt through vision supervision. Our experiments suggest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsBLIP: Bootstrapping Language-Image Pre-training · ALBEF
