VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual Question Answering

Ekta Sood; Fabian K\"ogel; Florian Strohm; Prajit Dhar; Andreas Bulling

arXiv:2109.13116·cs.CV·March 4, 2026

VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual Question Answering

Ekta Sood, Fabian K\"ogel, Florian Strohm, Prajit Dhar, Andreas Bulling

PDF

TL;DR

This paper introduces VQA-MHUG, a new gaze dataset capturing human attention during visual question answering, and analyzes how neural models' attention strategies compare to humans, revealing that text attention correlates with model performance.

Contribution

The paper presents a novel multimodal gaze dataset for VQA and provides the first analysis linking human and neural attention strategies, highlighting the importance of text attention.

Findings

01

Higher correlation with human text attention predicts better VQA performance

02

Neural models show similar attention patterns to humans on text modality

03

Insights suggest improving neural text attention could enhance VQA models

Abstract

We present VQA-MHUG - a novel 49-participant dataset of multimodal human gaze on both images and questions during visual question answering (VQA) collected using a high-speed eye tracker. We use our dataset to analyze the similarity between human and neural attentive strategies learned by five state-of-the-art VQA models: Modular Co-Attention Network (MCAN) with either grid or region features, Pythia, Bilinear Attention Network (BAN), and the Multimodal Factorized Bilinear Pooling Network (MFB). While prior work has focused on studying the image modality, our analyses show - for the first time - that for all models, higher correlation with human attention on text is a significant predictor of VQA performance. This finding points at a potential for improving VQA performance and, at the same time, calls for further research on neural text attention mechanisms and their integration into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Gaze Tracking and Assistive Technology · Domain Adaptation and Few-Shot Learning