ABC-CNN: An Attention Based Convolutional Neural Network for Visual   Question Answering

Kan Chen; Jiang Wang; Liang-Chieh Chen; Haoyuan Gao; Wei Xu; Ram; Nevatia

arXiv:1511.05960·cs.CV·April 5, 2016·278 cites

ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering

Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, Ram, Nevatia

PDF

Open Access

TL;DR

This paper introduces ABC-CNN, a novel attention-based deep learning model for visual question answering that improves accuracy by focusing on relevant image regions guided by the question.

Contribution

The paper presents a new attention mechanism within CNNs that dynamically generates question-guided attention maps for VQA tasks.

Findings

01

Achieves significant improvements over state-of-the-art on three benchmark datasets.

02

Question-guided attention maps highlight relevant image regions.

03

Demonstrates the effectiveness of attention in improving VQA accuracy.

Abstract

We propose a novel attention based deep learning architecture for visual question answering task (VQA). Given an image and an image related natural language question, VQA generates the natural language answer for the question. Generating the correct answers requires the model's attention to focus on the regions corresponding to the question, because different questions inquire about the attributes of different image regions. We introduce an attention based configurable convolutional neural network (ABC-CNN) to learn such question-guided attention. ABC-CNN determines an attention map for an image-question pair by convolving the image feature map with configurable convolutional kernels derived from the question's semantics. We evaluate the ABC-CNN architecture on three benchmark VQA datasets: Toronto COCO-QA, DAQUAR, and VQA dataset. ABC-CNN model achieves significant improvements over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning