Survey of Visual Question Answering: Datasets and Techniques
Akshay Kumar Gupta

TL;DR
This survey reviews the datasets and models used in visual question answering, highlighting recent advances, comparing approaches, and suggesting future research directions in combining NLP and computer vision.
Contribution
It provides a comprehensive overview of VQA datasets and models, classifies approaches into four categories, and compares their performances.
Findings
Deep learning with attention outperforms other models.
Datasets vary significantly in size and complexity.
Future work should focus on improving model interpretability.
Abstract
Visual question answering (or VQA) is a new and exciting problem that combines natural language processing and computer vision techniques. We present a survey of the various datasets and models that have been used to tackle this task. The first part of the survey details the various datasets for VQA and compares them along some common factors. The second part of this survey details the different approaches for VQA, classified into four types: non-deep learning models, deep learning models without attention, deep learning models with attention, and other models which do not fit into the first three. Finally, we compare the performances of these approaches and provide some directions for future work.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
