Effect of Visual Extensions on Natural Language Understanding in   Vision-and-Language Models

Taichi Iki; Akiko Aizawa

arXiv:2104.08066·cs.CL·September 24, 2021

Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models

Taichi Iki, Akiko Aizawa

PDF

Open Access 1 Repo

TL;DR

This paper evaluates how different vision-and-language model architectures affect natural language understanding, revealing that single-stream models may better preserve language capabilities after pre-training.

Contribution

It provides a comparative analysis of single-stream and dual-stream V&L models on NLU tasks, highlighting the impact of architecture and pre-training on language understanding.

Findings

01

Dual-stream models do not significantly outperform single-stream models in NLU tasks.

02

Pre-training generally causes a performance drop in NLU tasks.

03

Single-stream models may better maintain language understanding capabilities.

Abstract

A method for creating a vision-and-language (V&L) model is to extend a language model through structural modifications and V&L pre-training. Such an extension aims to make a V&L model inherit the capability of natural language understanding (NLU) from the original language model. To see how well this is achieved, we propose to evaluate V&L models using an NLU benchmark (GLUE). We compare five V&L models, including single-stream and dual-stream models, trained with the same pre-training. Dual-stream models, with their higher modality independence achieved by approximately doubling the number of parameters, are expected to preserve the NLU capability better. Our main finding is that the dual-stream scores are not much different than the single-stream scores, contrary to expectation. Further analysis shows that pre-training causes the performance drop in NLU tasks with few exceptions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alab-nii/eval_vl_glue
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling