Survey of Visual Question Answering: Datasets and Techniques

Akshay Kumar Gupta

arXiv:1705.03865·cs.CL·May 12, 2017·22 cites

Survey of Visual Question Answering: Datasets and Techniques

Akshay Kumar Gupta

PDF

Open Access

TL;DR

This survey reviews the datasets and models used in visual question answering, highlighting recent advances, comparing approaches, and suggesting future research directions in combining NLP and computer vision.

Contribution

It provides a comprehensive overview of VQA datasets and models, classifies approaches into four categories, and compares their performances.

Findings

01

Deep learning with attention outperforms other models.

02

Datasets vary significantly in size and complexity.

03

Future work should focus on improving model interpretability.

Abstract

Visual question answering (or VQA) is a new and exciting problem that combines natural language processing and computer vision techniques. We present a survey of the various datasets and models that have been used to tackle this task. The first part of the survey details the various datasets for VQA and compares them along some common factors. The second part of this survey details the different approaches for VQA, classified into four types: non-deep learning models, deep learning models without attention, deep learning models with attention, and other models which do not fit into the first three. Finally, we compare the performances of these approaches and provide some directions for future work.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning