Multimodality and Attention Increase Alignment in Natural Language   Prediction Between Humans and Computational Models

Viktor Kewenig; Andrew Lampinen; Samuel A. Nastase; Christopher; Edwards; Quitterie Lacome DEstalenx; Akilles Rechardt; Jeremy I Skipper and; Gabriella Vigliocco

arXiv:2308.06035·cs.AI·January 3, 2024·2 cites

Multimodality and Attention Increase Alignment in Natural Language Prediction Between Humans and Computational Models

Viktor Kewenig, Andrew Lampinen, Samuel A. Nastase, Christopher, Edwards, Quitterie Lacome DEstalenx, Akilles Rechardt, Jeremy I Skipper and, Gabriella Vigliocco

PDF

Open Access

TL;DR

This study demonstrates that multimodal models with attention mechanisms better align with human predictions of upcoming words, especially when visual cues are integrated, highlighting the importance of multimodality and attention in natural language understanding.

Contribution

It shows that multimodal models with attention mechanisms more closely replicate human language prediction, emphasizing the role of visual cues and attention in alignment.

Findings

01

Multimodal models align more closely with human predictions than unimodal models.

02

Including attention mechanisms doubles the alignment with human judgments.

03

Visual cues significantly improve model-human alignment in language prediction.

Abstract

The potential of multimodal generative artificial intelligence (mAI) to replicate human grounded language understanding, including the pragmatic, context-rich aspects of communication, remains to be clarified. Humans are known to use salient multimodal features, such as visual cues, to facilitate the processing of upcoming words. Correspondingly, multimodal computational models can integrate visual and linguistic data using a visual attention mechanism to assign next-word probabilities. To test whether these processes align, we tasked both human participants (N = 200) as well as several state-of-the-art computational models with evaluating the predictability of forthcoming words after viewing short audio-only or audio-visual clips with speech. During the task, the model's attention weights were recorded and human attention was indexed via eye tracking. Results show that predictability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications · Subtitles and Audiovisual Media

MethodsALIGN · Contrastive Language-Image Pre-training