Hierarchical Question-Image Co-Attention for Visual Question Answering

Jiasen Lu; Jianwei Yang; Dhruv Batra; Devi Parikh

arXiv:1606.00061·cs.CV·January 20, 2017·1.2k cites

Hierarchical Question-Image Co-Attention for Visual Question Answering

Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh

PDF

Open Access 5 Repos

TL;DR

This paper introduces a hierarchical co-attention model for VQA that jointly models image and question attention, leading to improved accuracy on standard datasets.

Contribution

It proposes a novel joint co-attention mechanism with hierarchical reasoning using CNNs, advancing the state-of-the-art in visual question answering.

Findings

01

Achieved new state-of-the-art accuracy on VQA and COCO-QA datasets.

02

Demonstrated the effectiveness of hierarchical question-image attention.

03

Improved performance with ResNet features.

Abstract

A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling "where to look" or visual attention, it is equally important to model "what words to listen to" or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsAverage Pooling · Residual Connection · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling