Answer Them All! Toward Universal Visual Question Answering Models

Robik Shrestha; Kushal Kafle; Christopher Kanan

arXiv:1903.00366·cs.CV·April 8, 2019·6 cites

Answer Them All! Toward Universal Visual Question Answering Models

Robik Shrestha, Kushal Kafle, Christopher Kanan

PDF

Open Access 2 Repos

TL;DR

This paper compares existing VQA models across natural and synthetic datasets, finds limited cross-domain generalization, and proposes a new model that performs well in both domains.

Contribution

It introduces a new VQA algorithm that achieves strong performance across diverse datasets, addressing the lack of universal models.

Findings

01

Existing models do not generalize well across domains.

02

Standardization enables fair comparison of models.

03

Proposed model outperforms previous methods in both domains.

Abstract

Visual Question Answering (VQA) research is split into two camps: the first focuses on VQA datasets that require natural image understanding and the second focuses on synthetic datasets that test reasoning. A good VQA algorithm should be capable of both, but only a few VQA algorithms are tested in this manner. We compare five state-of-the-art VQA algorithms across eight VQA datasets covering both domains. To make the comparison fair, all of the models are standardized as much as possible, e.g., they use the same visual features, answer vocabularies, etc. We find that methods do not generalize across the two domains. To address this problem, we propose a new VQA algorithm that rivals or exceeds the state-of-the-art for both domains.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning