Visual question answering: from early developments to recent advances -- a survey
Ngoc Dung Huynh, Mohamed Reda Bouadjenek, Sunil Aryal, Imran Razzak,, Hakim Hacid

TL;DR
This survey reviews the evolution of Visual Question Answering (VQA), covering architectures, datasets, and applications, emphasizing recent advances like Large Visual Language Models and outlining future research challenges.
Contribution
It provides a comprehensive taxonomy of VQA architectures, reviews deep learning methods and emerging LVLMs, and discusses datasets, evaluation metrics, and future research directions.
Findings
Deep learning-based VQA approaches dominate the field.
Large Visual Language Models show promising results in VQA tasks.
Identified key challenges and open questions for future research.
Abstract
Visual Question Answering (VQA) is an evolving research field aimed at enabling machines to answer questions about visual content by integrating image and language processing techniques such as feature extraction, object detection, text embedding, natural language understanding, and language generation. With the growth of multimodal data research, VQA has gained significant attention due to its broad applications, including interactive educational tools, medical image diagnosis, customer service, entertainment, and social media captioning. Additionally, VQA plays a vital role in assisting visually impaired individuals by generating descriptive content from images. This survey introduces a taxonomy of VQA architectures, categorizing them based on design choices and key components to facilitate comparative analysis and evaluation. We review major VQA approaches, focusing on deep…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsSoftmax · travel james · Attention Is All You Need
