Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering
Huijuan Xu, Kate Saenko

TL;DR
This paper introduces a Spatial Memory Network with a novel spatial attention mechanism for visual question answering, improving the model's ability to perform spatial inference and achieve better accuracy on benchmark datasets.
Contribution
The paper proposes a new Spatial Memory Network with a two-hop attention architecture that explicitly models spatial inference in VQA tasks.
Findings
Improved accuracy on DAQUAR and VQA datasets.
Effective spatial attention alignment between words and image patches.
Visualization of attention weights demonstrates spatial inference capabilities.
Abstract
We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent networks to this problem, but have failed to model spatial inference. To remedy this, we propose a model we call the Spatial Memory Network and apply it to the VQA task. Memory networks are recurrent neural networks with an explicit attention mechanism that selects certain parts of the information stored in memory. Our Spatial Memory Network stores neuron activations from different spatial regions of the image in its memory, and uses the question to choose relevant regions for computing the answer, a process of which constitutes a single "hop" in the network. We propose a novel spatial attention architecture that aligns words with image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMemory Network
