Learning to Reason: End-to-End Module Networks for Visual Question   Answering

Ronghang Hu; Jacob Andreas; Marcus Rohrbach; Trevor Darrell; Kate; Saenko

arXiv:1704.05526·cs.CV·September 13, 2017·113 cites

Learning to Reason: End-to-End Module Networks for Visual Question Answering

Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Kate, Saenko

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces End-to-End Module Networks (N2NMNs) that learn to generate question-specific reasoning networks directly from data, improving compositional visual question answering accuracy without relying on external parsers.

Contribution

N2NMNs learn to predict network layouts and parameters jointly, surpassing previous methods by eliminating parser dependency and discovering interpretable, question-specific architectures.

Findings

01

Nearly 50% error reduction on CLEVR dataset

02

Discovered interpretable, question-specific network architectures

03

Outperformed state-of-the-art attentional approaches

Abstract

Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems. For example, to answer "is there an equal number of balls and boxes?" we can look for balls, look for boxes, count them, and compare the results. The recently proposed Neural Module Network (NMN) architecture implements this approach to question answering by parsing questions into linguistic substructures and assembling question-specific deep networks from smaller modules that each solve one subtask. However, existing NMN implementations rely on brittle off-the-shelf parsers, and are restricted to the module configurations proposed by these parsers rather than learning them from data. In this paper, we propose End-to-End Module Networks (N2NMNs), which learn to reason by directly predicting instance-specific network layouts without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ronghanghu/n2nmn
tf

Videos

Learning to Reason: End-to-End Module Networks for Visual Question Answering· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsAverage Pooling · Residual Connection · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling