TL;DR
This paper reviews the evolution of Visual Question Answering (VQA), analyzing datasets, algorithms, and challenges, highlighting limitations of current resources and proposing future research directions in the intersection of computer vision and NLP.
Contribution
It provides a comprehensive critique of existing VQA datasets and algorithms, and discusses future challenges and directions for the field.
Findings
Current datasets have limitations in training and evaluation.
Many algorithms have been proposed with varying effectiveness.
Future research should address dataset limitations and explore new algorithmic approaches.
Abstract
Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing that has garnered a large amount of interest from the deep learning, computer vision, and natural language processing communities. In VQA, an algorithm needs to answer text-based questions about images. Since the release of the first VQA dataset in 2014, additional datasets have been released and many algorithms have been proposed. In this review, we critically examine the current state of VQA in terms of problem formulation, existing datasets, evaluation metrics, and algorithms. In particular, we discuss the limitations of current datasets with regard to their ability to properly train and assess VQA algorithms. We then exhaustively review existing algorithms for VQA. Finally, we discuss possible future directions for VQA and image understanding research.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
