2nd Place Solution to the GQA Challenge 2019
Shijie Geng, Ji Zhang, Hang Zhang, Ahmed Elgammal and, Dimitris N. Metaxas

TL;DR
This paper introduces a simple statistical feature-based method that significantly improves visual question answering performance, demonstrating the importance of feature extraction over reasoning in complex visual reasoning tasks.
Contribution
The paper presents a novel approach using statistical features from question words to enhance reasoning in visual question answering, highlighting the bottleneck in feature extraction.
Findings
Statistical features outperform detected features in reasoning tasks.
Using ground-truth features yields the best performance.
The method achieved 2nd place in the GQA Challenge 2019.
Abstract
We present a simple method that achieves unexpectedly superior performance for Complex Reasoning involved Visual Question Answering. Our solution collects statistical features from high-frequency words of all the questions asked about an image and use them as accurate knowledge for answering further questions of the same image. We are fully aware that this setting is not ubiquitously applicable, and in a more common setting one should assume the questions are asked separately and they cannot be gathered to obtain a knowledge base. Nonetheless, we use this method as an evidence to demonstrate our observation that the bottleneck effect is more severe on the feature extraction part than it is on the knowledge reasoning part. We show significant gaps when using the same reasoning model with 1) ground-truth features; 2) statistical features; 3) detected features from completely learned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
