Multimodal Residual Learning for Visual QA

Jin-Hwa Kim; Sang-Woo Lee; Dong-Hyun Kwak; Min-Oh Heo; Jeonghee Kim,; Jung-Woo Ha; Byoung-Tak Zhang

arXiv:1606.01455·cs.CV·September 1, 2016·209 cites

Multimodal Residual Learning for Visual QA

Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak, Min-Oh Heo, Jeonghee Kim,, Jung-Woo Ha, Byoung-Tak Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Multimodal Residual Networks (MRN), a novel deep learning architecture that effectively combines vision and language for visual question-answering, achieving state-of-the-art results and providing visualizations of joint attention effects.

Contribution

The paper proposes MRN, a new multimodal residual learning framework that enhances joint representation learning from vision and language, extending residual learning to multimodal tasks.

Findings

01

Achieves state-of-the-art results on Visual QA dataset.

02

Effectively learns joint representations from vision and language.

03

Provides visualization of attention effects in multimodal learning.

Abstract

Deep neural networks continue to advance the state-of-the-art of image recognition tasks with various methods. However, applications of these methods to multimodality remain limited. We present Multimodal Residual Networks (MRN) for the multimodal residual learning of visual question-answering, which extends the idea of the deep residual learning. Unlike the deep residual learning, MRN effectively learns the joint representation from vision and language information. The main idea is to use element-wise multiplication for the joint residual mappings exploiting the residual learning of the attentional models in recent studies. Various alternative models introduced by multimodality are explored based on our study. We achieve the state-of-the-art results on the Visual QA dataset for both Open-Ended and Multiple-Choice tasks. Moreover, we introduce a novel method to visualize the attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jnhwkim/nips-mrn-vqa
torchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques