From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Md Farhan Ishmam, Md Sakib Hossain Shovon, M.F. Mridha, Nilanjan Dey

TL;DR
This paper provides a comprehensive survey of Visual Question Answering (VQA), covering its evolution, datasets, methods, challenges, and future directions in multimodal AI, highlighting the shift from traditional to vision-language pre-training approaches.
Contribution
It offers a detailed taxonomy of VQA, analyzes historical and recent trends, and identifies open problems and future research opportunities in the field.
Findings
VQA datasets have expanded to include diverse visual inputs.
Vision-language pre-training has shifted the landscape of VQA methods.
Several open challenges remain in dataset diversity, model robustness, and multimodal reasoning.
Abstract
The multimodal task of Visual Question Answering (VQA) encompassing elements of Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers to questions on any visual input. Over time, the scope of VQA has expanded from datasets focusing on an extensive collection of natural images to datasets featuring synthetic images, video, 3D environments, and various other visual inputs. The emergence of large pre-trained networks has shifted the early VQA approaches relying on feature extraction and fusion schemes to vision language pre-training (VLP) techniques. However, there is a lack of comprehensive surveys that encompass both traditional VQA architectures and contemporary VLP-based methods. Furthermore, the VLP challenges in the lens of VQA haven't been thoroughly explored, leaving room for potential open problems to emerge. Our work presents a survey in the domain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsSparse Evolutionary Training
