Is Multimodal Vision Supervision Beneficial to Language?

Avinash Madasu; Vasudev Lal

arXiv:2302.05016·cs.CV·April 18, 2023

Is Multimodal Vision Supervision Beneficial to Language?

Avinash Madasu, Vasudev Lal

PDF

Open Access 1 Repo

TL;DR

This paper investigates whether vision supervision enhances language representations in multimodal models, finding that vanilla text encoders often outperform vision-supervised ones on language understanding and reasoning tasks.

Contribution

It provides a comparative analysis showing that vision supervision does not necessarily improve language representations in current multimodal models.

Findings

01

Vanilla language representations outperform vision-supervised ones on most benchmarks.

02

Vision supervision does not significantly enhance language understanding in current models.

03

The study highlights limitations of current vision-language pre-training approaches.

Abstract

Vision (image and video) - Language (VL) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks like image-retrieval, video-retrieval, visual question answering etc. These models are trained in an unsupervised way and greatly benefit from the complementary modality supervision. In this paper, we explore if the language representations trained using vision supervision perform better than vanilla language representations on Natural Language Understanding and commonsense reasoning benchmarks. We experiment with a diverse set of image-text models such as ALBEF, BLIP, METER and video-text models like ALPRO, Frozen-in-Time (FiT), VIOLET. We compare the performance of language representations of stand-alone text encoders of these models to the language representations of text encoders learnt through vision supervision. Our experiments suggest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

avinashsai/mml
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsBLIP: Bootstrapping Language-Image Pre-training · ALBEF