Visual Question Answering: A Survey of Methods and Datasets
Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, Anton van, den Hengel

TL;DR
This survey reviews current methods and datasets in Visual Question Answering, highlighting approaches that combine visual and textual reasoning, and discusses future directions involving knowledge bases and NLP models.
Contribution
It provides a comprehensive classification of VQA methods and an in-depth review of datasets, emphasizing the integration of structured knowledge and reasoning capabilities.
Findings
Comparison of neural network-based approaches
Analysis of datasets including Visual Genome
Discussion on future research directions
Abstract
Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. Given an image and a question in natural language, it requires reasoning over visual elements of the image and general knowledge to infer the correct answer. In the first part of this survey, we examine the state of the art by comparing modern approaches to the problem. We classify methods by their mechanism to connect the visual and textual modalities. In particular, we examine the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space. We also discuss memory-augmented and modular architectures that interface with structured knowledge bases. In the second part of this survey, we review the datasets available for training and evaluating VQA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
