Answer Them All! Toward Universal Visual Question Answering Models
Robik Shrestha, Kushal Kafle, Christopher Kanan

TL;DR
This paper compares existing VQA models across natural and synthetic datasets, finds limited cross-domain generalization, and proposes a new model that performs well in both domains.
Contribution
It introduces a new VQA algorithm that achieves strong performance across diverse datasets, addressing the lack of universal models.
Findings
Existing models do not generalize well across domains.
Standardization enables fair comparison of models.
Proposed model outperforms previous methods in both domains.
Abstract
Visual Question Answering (VQA) research is split into two camps: the first focuses on VQA datasets that require natural image understanding and the second focuses on synthetic datasets that test reasoning. A good VQA algorithm should be capable of both, but only a few VQA algorithms are tested in this manner. We compare five state-of-the-art VQA algorithms across eight VQA datasets covering both domains. To make the comparison fair, all of the models are standardized as much as possible, e.g., they use the same visual features, answer vocabularies, etc. We find that methods do not generalize across the two domains. To address this problem, we propose a new VQA algorithm that rivals or exceeds the state-of-the-art for both domains.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
