The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions
Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel

TL;DR
This paper introduces a scalable approach for visual question answering that leverages existing external algorithms and a novel co-attention model, achieving state-of-the-art results while providing human-readable explanations.
Contribution
It proposes a method that exploits pre-existing image operation algorithms and a new co-attention model for VQA, enabling end-to-end training with explanations.
Findings
Achieves state-of-the-art results on Visual Genome and VQA datasets.
Generates human-readable reasons for its answers.
Effectively exploits off-the-shelf algorithms for complex image operations.
Abstract
One of the most intriguing features of the Visual Question Answering (VQA) challenge is the unpredictability of the questions. Extracting the information required to answer them demands a variety of image operations from detection and counting, to segmentation and reconstruction. To train a method to perform even one of these operations accurately from {image,question,answer} tuples would be challenging, but to aim to achieve them all with a limited set of such training data seems ambitious at best. We propose here instead a more general and scalable approach which exploits the fact that very good methods to achieve these operations already exist, and thus do not need to be trained. Our method thus learns how to exploit a set of external off-the-shelf algorithms to achieve its goal, an approach that has something in common with the Neural Turing Machine. The core of our proposed method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
