An Analysis of Visual Question Answering Algorithms
Kushal Kafle, Christopher Kanan

TL;DR
This paper critically evaluates existing visual question answering algorithms using a new, extensive dataset and introduces improved evaluation methods to better understand their strengths and weaknesses.
Contribution
It presents a new large-scale dataset with diverse question types and meaningless questions, along with novel evaluation schemes to improve algorithm assessment.
Findings
Attention benefits certain question categories more.
Simple models can outperform complex ones on easy questions.
Analysis reveals strengths and weaknesses of current VQA models.
Abstract
In visual question answering (VQA), an algorithm must answer text-based questions about images. While multiple datasets for VQA have been created since late 2014, they all have flaws in both their content and the way algorithms are evaluated on them. As a result, evaluation scores are inflated and predominantly determined by answering easier questions, making it difficult to compare different methods. In this paper, we analyze existing VQA algorithms using a new dataset. It contains over 1.6 million questions organized into 12 different categories. We also introduce questions that are meaningless for a given image to force a VQA system to reason about image content. We propose new evaluation schemes that compensate for over-represented question-types and make it easier to study the strengths and weaknesses of algorithms. We analyze the performance of both baseline and state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
