Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
Mateusz Malinowski, Marcus Rohrbach, Mario Fritz

TL;DR
This paper introduces Neural-Image-QA, an end-to-end neural approach for visual question answering that combines image and language understanding, significantly improving performance over previous methods.
Contribution
It presents a novel joint training framework for multi-modal question answering and extends the dataset with human consensus annotations.
Findings
Doubles the performance of previous best methods
Provides insights into information content in language alone
Introduces new metrics for human consensus analysis
Abstract
We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose Neural-Image-QA, an end-to-end formulation to this problem for which all parts are trained jointly. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language input (image and question). Our approach Neural-Image-QA doubles the performance of the previous best approach on this problem. We provide additional insights into the problem by analyzing how much information is contained only in the language part for which we provide a new human baseline. To study human consensus, which is related to the ambiguities inherent in this challenging task, we propose two novel metrics and collect additional answers which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
