Do Vision-and-Language Transformers Learn Grounded Predicate-Noun   Dependencies?

Mitja Nikolaus; Emmanuelle Salin; Stephane Ayache; Abdellah Fourtassi,; Benoit Favre

arXiv:2210.12079·cs.CL·October 24, 2022

Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?

Mitja Nikolaus, Emmanuelle Salin, Stephane Ayache, Abdellah Fourtassi,, Benoit Favre

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new multimodal task to evaluate whether vision-and-language Transformers understand predicate-noun dependencies, revealing variability in model performance and emphasizing the importance of data quality and fine-grained pretraining.

Contribution

It presents a novel controlled evaluation task for predicate-noun dependencies and analyzes factors influencing model performance, such as pretraining data quality and objectives.

Findings

01

Performance varies significantly across models.

02

Data quality impacts model understanding.

03

Fine-grained pretraining improves results.

Abstract

Recent advances in vision-and-language modeling have seen the development of Transformer architectures that achieve remarkable performance on multimodal reasoning tasks. Yet, the exact capabilities of these black-box models are still poorly understood. While much of previous work has focused on studying their ability to learn meaning at the word-level, their ability to track syntactic dependencies between words has received less attention. We take a first step in closing this gap by creating a new multimodal task targeted at evaluating understanding of predicate-noun dependencies in a controlled setup. We evaluate a range of state-of-the-art models and find that their performance on the task varies considerably, with some models performing relatively well and others at chance level. In an effort to explain this variability, our analyses indicate that the quality (and not only sheer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mitjanikolaus/multimodal-predicate-noun-dependencies
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Test · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Label Smoothing · Absolute Position Encodings · Layer Normalization · Byte Pair Encoding