Understanding Attention for Vision-and-Language Tasks

Feiqi Cao; Soyeon Caren Han; Siqu Long; Changwei Xu; Josiah Poon

arXiv:2208.08104·cs.CV·September 23, 2022

Understanding Attention for Vision-and-Language Tasks

Feiqi Cao, Soyeon Caren Han, Siqu Long, Changwei Xu, Josiah Poon

PDF

Open Access 1 Repo

TL;DR

This paper provides a comprehensive analysis of how different attention score calculation methods affect the interpretability and performance of vision-and-language models across various tasks, highlighting the importance of attention alignment choices.

Contribution

It is the first study to systematically examine the impact of attention alignment calculation methods on model interpretability and performance in VL tasks.

Findings

01

Attention score calculation methods influence interpretability.

02

Different methods impact model performance variably.

03

Analysis applies across multiple VL tasks.

Abstract

Attention mechanism has been used as an important component across Vision-and-Language(VL) tasks in order to bridge the semantic gap between visual and textual features. While attention has been widely used in VL tasks, it has not been examined the capability of different attention alignment calculation in bridging the semantic gap between visual and textual clues. In this research, we conduct a comprehensive analysis on understanding the role of attention alignment by looking into the attention score calculation methods and check how it actually represents the visual region's and textual token's significance for the global assessment. We also analyse the conditions which attention score calculation mechanism would be more (or less) interpretable, and which may impact the model performance on three different VL tasks, including visual question answering, text-to-image generation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

adlnlp/attention_vl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning