VISTA: A Visual and Textual Attention Dataset for Interpreting Multimodal Models
Harshit, Tolga Tasdizen

TL;DR
This paper introduces a new dataset that aligns human visual attention with image-text pairs to analyze and interpret the decision-making processes of vision-language models, enhancing transparency and trust.
Contribution
It provides an image-text aligned human visual attention dataset and a methodology to compare model heatmaps with human attention for better interpretability.
Findings
Models' attention aligns variably with human attention
Analysis reveals strengths and weaknesses in model interpretability
Dataset enables systematic evaluation of visual-text associations
Abstract
The recent developments in deep learning led to the integration of natural language processing (NLP) with computer vision, resulting in powerful integrated Vision and Language Models (VLMs). Despite their remarkable capabilities, these models are frequently regarded as black boxes within the machine learning research community. This raises a critical question: which parts of an image correspond to specific segments of text, and how can we decipher these associations? Understanding these connections is essential for enhancing model transparency, interpretability, and trustworthiness. To answer this question, we present an image-text aligned human visual attention dataset that maps specific associations between image regions and corresponding text segments. We then compare the internal heatmaps generated by VL models with this dataset, allowing us to analyze and better understand the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need · ALIGN · Focus
