Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for   Visual Question Answering

Huijuan Xu; Kate Saenko

arXiv:1511.05234·cs.CV·March 22, 2016·105 cites

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

Huijuan Xu, Kate Saenko

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Spatial Memory Network with a novel spatial attention mechanism for visual question answering, improving the model's ability to perform spatial inference and achieve better accuracy on benchmark datasets.

Contribution

The paper proposes a new Spatial Memory Network with a two-hop attention architecture that explicitly models spatial inference in VQA tasks.

Findings

01

Improved accuracy on DAQUAR and VQA datasets.

02

Effective spatial attention alignment between words and image patches.

03

Visualization of attention weights demonstrates spatial inference capabilities.

Abstract

We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent networks to this problem, but have failed to model spatial inference. To remedy this, we propose a model we call the Spatial Memory Network and apply it to the VQA task. Memory networks are recurrent neural networks with an explicit attention mechanism that selects certain parts of the information stored in memory. Our Spatial Memory Network stores neuron activations from different spatial regions of the image in its memory, and uses the question to choose relevant regions for computing the answer, a process of which constitutes a single "hop" in the network. We propose a novel spatial attention architecture that aligns words with image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zixuwang1996/VQA-reading-list
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsMemory Network