The Quest for Visual Understanding: A Journey Through the Evolution of   Visual Question Answering

Anupam Pandey; Deepjyoti Bodo; Arpan Phukan; Asif Ekbal

arXiv:2501.07109·cs.CV·January 14, 2025

The Quest for Visual Understanding: A Journey Through the Evolution of Visual Question Answering

Anupam Pandey, Deepjyoti Bodo, Arpan Phukan, Asif Ekbal

PDF

TL;DR

This survey reviews the evolution of Visual Question Answering (VQA) from its inception in 2015, highlighting key technological advances, challenges, and future directions in multimodal AI systems that interpret images and language.

Contribution

It provides a comprehensive overview of VQA's development, emphasizing the impact of transformer architectures, multimodal pre-training, and emerging trends in the field.

Findings

01

Transformers and pre-training significantly advanced VQA performance.

02

Major datasets and models have shaped the evolution of VQA.

03

Challenges include dataset bias, interpretability, and reasoning capabilities.

Abstract

Visual Question Answering (VQA) is an interdisciplinary field that bridges the gap between computer vision (CV) and natural language processing(NLP), enabling Artificial Intelligence(AI) systems to answer questions about images. Since its inception in 2015, VQA has rapidly evolved, driven by advances in deep learning, attention mechanisms, and transformer-based models. This survey traces the journey of VQA from its early days, through major breakthroughs, such as attention mechanisms, compositional reasoning, and the rise of vision-language pre-training methods. We highlight key models, datasets, and techniques that shaped the development of VQA systems, emphasizing the pivotal role of transformer architectures and multimodal pre-training in driving recent progress. Additionally, we explore specialized applications of VQA in domains like healthcare and discuss ongoing challenges, such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need