Training Vision-Language Models with Less Bimodal Supervision
Elad Segal, Ben Bogin, Jonathan Berant

TL;DR
This paper explores reducing the reliance on aligned image-text pairs for training vision-language models, showing that for simple tasks, bimodal data can be eliminated with minimal performance loss, but complex reasoning tasks still require some bimodal supervision.
Contribution
It demonstrates that vision-language models can be pretrained with significantly less bimodal supervision, especially for simpler tasks, by using independent modality pretraining and weak supervision techniques.
Findings
Eliminating bimodal supervision causes minor performance loss on simple tasks.
Using only 5% of bimodal data results in moderate performance degradation.
Complex reasoning tasks still require some bimodal supervision for good performance.
Abstract
Standard practice in pretraining multimodal models, such as vision-language models, is to rely on pairs of aligned inputs from both modalities, for example, aligned image-text pairs. However, such pairs can be difficult to obtain in low-resource settings and for some modality pairs (e.g., structured tables and images). In this work, we investigate the extent to which we can reduce the reliance on such parallel data, which we term \emph{bimodal supervision}, and use models that are pretrained on each modality independently. We experiment with a high-performing vision-language model, and analyze the effect of bimodal supervision on three vision-language tasks. We find that on simpler tasks, such as VQAv2 and GQA, one can eliminate bimodal supervision completely, suffering only a minor loss in performance. Conversely, for NLVR2, which requires more complex reasoning, training without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
