Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in   Multimodal Transformers

Stella Frank; Emanuele Bugliarello; Desmond Elliott

arXiv:2109.04448·cs.CL·September 10, 2021

Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers

Stella Frank, Emanuele Bugliarello, Desmond Elliott

PDF

Open Access 4 Repos

TL;DR

This paper introduces a diagnostic method to evaluate whether multimodal transformers genuinely integrate visual and language information by ablating inputs and measuring task performance, revealing asymmetries in cross-modal understanding.

Contribution

The paper proposes a novel input ablation technique to assess cross-modal integration in multimodal transformers, highlighting their asymmetrical reliance on modalities.

Findings

01

Models struggle more to predict text when visual input is ablated.

02

Models perform better at visual tasks even when text input is missing.

03

Cross-modal integration in models is asymmetrical, favoring visual over language understanding.

Abstract

Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities. We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually integrate cross-modal information. This method involves ablating inputs from one modality, either entirely or selectively based on cross-modal grounding alignments, and evaluating the model prediction performance on the other modality. Model performance is measured by modality-specific tasks that mirror the model pretraining objectives (e.g. masked language modelling for text). Models that have learned to construct cross-modal representations using both modalities are expected to perform worse when inputs are missing from a modality. We find that recently proposed models have much greater relative difficulty predicting text when visual information is ablated,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning