Improved Fusion of Visual and Language Representations by Dense   Symmetric Co-Attention for Visual Question Answering

Duy-Kien Nguyen; Takayuki Okatani

arXiv:1804.00775·cs.CV·December 4, 2018

Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

Duy-Kien Nguyen, Takayuki Okatani

PDF

1 Repo

TL;DR

This paper introduces a symmetric dense co-attention mechanism for visual question answering, enhancing the fusion of visual and language features to improve accuracy and interpretability.

Contribution

It proposes a simple, fully symmetric architecture with multi-step interactions that achieves state-of-the-art results on VQA datasets.

Findings

01

Achieves new state-of-the-art accuracy on VQA and VQA 2.0 datasets.

02

Demonstrates effective attention maps that align with human reasoning.

03

Small model size with high performance.

Abstract

A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cvlab-tohoku/Dense-CoAttention-Network
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.