Seeing past words: Testing the cross-modal capabilities of pretrained   V&L models on counting tasks

Letitia Parcalabescu; Albert Gatt; Anette Frank; Iacer; Calixto

arXiv:2012.12352·cs.CV·June 18, 2021

Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks

Letitia Parcalabescu, Albert Gatt, Anette Frank, Iacer, Calixto

PDF

Open Access 1 Video

TL;DR

This study evaluates pretrained vision and language models on their ability to perform counting and multimodal reasoning tasks, revealing strengths in discrimination but significant limitations in counting and generalization, highlighting the need for targeted analysis.

Contribution

The paper provides a comprehensive assessment of V&L models' reasoning abilities, especially in counting tasks, and discusses their limitations and potential causes like dataset bias and catastrophic forgetting.

Findings

01

Models excel at discriminating correct image-sentence pairs.

02

Pretrained V&L models struggle with counting and generalizing to new quantities.

03

Dataset bias and entity individuation issues impact model performance.

Abstract

We investigate the reasoning ability of pretrained vision and language (V&L) models in two tasks that require multimodal integration: (1) discriminating a correct image-sentence pair from an incorrect one, and (2) counting entities in an image. We evaluate three pretrained V&L models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our results show that models solve task (1) very well, as expected, since all models are pretrained on task (1). However, none of the pretrained V&L models is able to adequately solve task (2), our counting probe, and they cannot generalise to out-of-distribution quantities. We propose a number of explanations for these findings: LXMERT (and to some extent ViLBERT 12-in-1) show some evidence of catastrophic forgetting on task (1). Concerning our results on the counting probe, we find evidence that all models are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsLearning Cross-Modality Encoder Representations from Transformers · Vision-and-Language BERT