Deep Modular Co-Attention Networks for Visual Question Answering

Zhou Yu; Jun Yu; Yuhao Cui; Dacheng Tao; Qi Tian

arXiv:1906.10770·cs.CV·June 27, 2019·99 cites

Deep Modular Co-Attention Networks for Visual Question Answering

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, Qi Tian

PDF

Open Access 5 Repos

TL;DR

This paper introduces a deep Modular Co-Attention Network (MCAN) for Visual Question Answering that effectively models question-image interactions through cascaded attention layers, significantly improving accuracy over previous models.

Contribution

The paper proposes a novel deep co-attention architecture with modular layers that enhance interaction modeling in VQA, outperforming shallow models and previous state-of-the-art methods.

Findings

01

MCAN achieves 70.63% accuracy on VQA-v2 test-dev.

02

Deep modular co-attention layers outperform shallow models.

03

Extensive ablation studies validate the effectiveness of the proposed architecture.

Abstract

Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions. Therefore, designing an effective `co-attention' model to associate key words in questions with key objects in images is central to VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, and deep co-attention models show little improvement over their shallow counterparts. In this paper, we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth. Each MCA layer models the self-attention of questions and images, as well as the guided-attention of images jointly using a modular composition of two basic attention units. We quantitatively and qualitatively evaluate MCAN on the benchmark VQA-v2 dataset and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning