VISTA: A Visual and Textual Attention Dataset for Interpreting   Multimodal Models

Harshit; Tolga Tasdizen

arXiv:2410.04609·cs.CV·October 8, 2024

VISTA: A Visual and Textual Attention Dataset for Interpreting Multimodal Models

Harshit, Tolga Tasdizen

PDF

Open Access

TL;DR

This paper introduces a new dataset that aligns human visual attention with image-text pairs to analyze and interpret the decision-making processes of vision-language models, enhancing transparency and trust.

Contribution

It provides an image-text aligned human visual attention dataset and a methodology to compare model heatmaps with human attention for better interpretability.

Findings

01

Models' attention aligns variably with human attention

02

Analysis reveals strengths and weaknesses in model interpretability

03

Dataset enables systematic evaluation of visual-text associations

Abstract

The recent developments in deep learning led to the integration of natural language processing (NLP) with computer vision, resulting in powerful integrated Vision and Language Models (VLMs). Despite their remarkable capabilities, these models are frequently regarded as black boxes within the machine learning research community. This raises a critical question: which parts of an image correspond to specific segments of text, and how can we decipher these associations? Understanding these connections is essential for enhancing model transparency, interpretability, and trustworthiness. To answer this question, we present an image-text aligned human visual attention dataset that maps specific associations between image regions and corresponding text segments. We then compare the internal heatmaps generated by VL models with this dataset, allowing us to analyze and better understand the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need · ALIGN · Focus