Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models
Taichi Iki, Akiko Aizawa

TL;DR
This paper evaluates how different vision-and-language model architectures affect natural language understanding, revealing that single-stream models may better preserve language capabilities after pre-training.
Contribution
It provides a comparative analysis of single-stream and dual-stream V&L models on NLU tasks, highlighting the impact of architecture and pre-training on language understanding.
Findings
Dual-stream models do not significantly outperform single-stream models in NLU tasks.
Pre-training generally causes a performance drop in NLU tasks.
Single-stream models may better maintain language understanding capabilities.
Abstract
A method for creating a vision-and-language (V&L) model is to extend a language model through structural modifications and V&L pre-training. Such an extension aims to make a V&L model inherit the capability of natural language understanding (NLU) from the original language model. To see how well this is achieved, we propose to evaluate V&L models using an NLU benchmark (GLUE). We compare five V&L models, including single-stream and dual-stream models, trained with the same pre-training. Dual-stream models, with their higher modality independence achieved by approximately doubling the number of parameters, are expected to preserve the NLU capability better. Our main finding is that the dual-stream scores are not much different than the single-stream scores, contrary to expectation. Further analysis shows that pre-training causes the performance drop in NLU tasks with few exceptions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
