Guiding Visual Question Answering with Attention Priors
Thao Minh Le, Vuong Le, Sunil Gupta, Svetha Venkatesh, Truyen Tran

TL;DR
This paper introduces a method to improve visual question answering by guiding attention mechanisms with explicit linguistic-visual grounding, enhancing accuracy, robustness, and interpretability without requiring extensive supervision.
Contribution
It proposes a novel approach to guide attention in VQA models using learned grounding from question-image pairs, without needing answer annotations or external supervision.
Findings
Improved VQA model performance.
Enhanced robustness with limited data.
Increased interpretability of attention mechanisms.
Abstract
The current success of modern visual reasoning systems is arguably attributed to cross-modality attention mechanisms. However, in deliberative reasoning such as in VQA, attention is unconstrained at each step, and thus may serve as a statistical pooling mechanism rather than a semantic operation intended to select information relevant to inference. This is because at training time, attention is only guided by a very sparse signal (i.e. the answer label) at the end of the inference chain. This causes the cross-modality attention weights to deviate from the desired visual-language bindings. To rectify this deviation, we propose to guide the attention mechanism using explicit linguistic-visual grounding. This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects. Here we learn the grounding from the pairing of questions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Guiding Visual Question Answering with Attention Priors· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
