Multimodal Residual Learning for Visual QA
Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak, Min-Oh Heo, Jeonghee Kim,, Jung-Woo Ha, Byoung-Tak Zhang

TL;DR
This paper introduces Multimodal Residual Networks (MRN), a novel deep learning architecture that effectively combines vision and language for visual question-answering, achieving state-of-the-art results and providing visualizations of joint attention effects.
Contribution
The paper proposes MRN, a new multimodal residual learning framework that enhances joint representation learning from vision and language, extending residual learning to multimodal tasks.
Findings
Achieves state-of-the-art results on Visual QA dataset.
Effectively learns joint representations from vision and language.
Provides visualization of attention effects in multimodal learning.
Abstract
Deep neural networks continue to advance the state-of-the-art of image recognition tasks with various methods. However, applications of these methods to multimodality remain limited. We present Multimodal Residual Networks (MRN) for the multimodal residual learning of visual question-answering, which extends the idea of the deep residual learning. Unlike the deep residual learning, MRN effectively learns the joint representation from vision and language information. The main idea is to use element-wise multiplication for the joint residual mappings exploiting the residual learning of the attentional models in recent studies. Various alternative models introduced by multimodality are explored based on our study. We achieve the state-of-the-art results on the Visual QA dataset for both Open-Ended and Multiple-Choice tasks. Moreover, we introduce a novel method to visualize the attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
