The Quest for Visual Understanding: A Journey Through the Evolution of Visual Question Answering
Anupam Pandey, Deepjyoti Bodo, Arpan Phukan, Asif Ekbal

TL;DR
This survey reviews the evolution of Visual Question Answering (VQA) from its inception in 2015, highlighting key technological advances, challenges, and future directions in multimodal AI systems that interpret images and language.
Contribution
It provides a comprehensive overview of VQA's development, emphasizing the impact of transformer architectures, multimodal pre-training, and emerging trends in the field.
Findings
Transformers and pre-training significantly advanced VQA performance.
Major datasets and models have shaped the evolution of VQA.
Challenges include dataset bias, interpretability, and reasoning capabilities.
Abstract
Visual Question Answering (VQA) is an interdisciplinary field that bridges the gap between computer vision (CV) and natural language processing(NLP), enabling Artificial Intelligence(AI) systems to answer questions about images. Since its inception in 2015, VQA has rapidly evolved, driven by advances in deep learning, attention mechanisms, and transformer-based models. This survey traces the journey of VQA from its early days, through major breakthroughs, such as attention mechanisms, compositional reasoning, and the rise of vision-language pre-training methods. We highlight key models, datasets, and techniques that shaped the development of VQA systems, emphasizing the pivotal role of transformer architectures and multimodal pre-training in driving recent progress. Additionally, we explore specialized applications of VQA in domains like healthcare and discuss ongoing challenges, such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
