Visual question answering: from early developments to recent advances --   a survey

Ngoc Dung Huynh; Mohamed Reda Bouadjenek; Sunil Aryal; Imran Razzak,; Hakim Hacid

arXiv:2501.03939·cs.CV·January 14, 2025·2 cites

Visual question answering: from early developments to recent advances -- a survey

Ngoc Dung Huynh, Mohamed Reda Bouadjenek, Sunil Aryal, Imran Razzak,, Hakim Hacid

PDF

Open Access

TL;DR

This survey reviews the evolution of Visual Question Answering (VQA), covering architectures, datasets, and applications, emphasizing recent advances like Large Visual Language Models and outlining future research challenges.

Contribution

It provides a comprehensive taxonomy of VQA architectures, reviews deep learning methods and emerging LVLMs, and discusses datasets, evaluation metrics, and future research directions.

Findings

01

Deep learning-based VQA approaches dominate the field.

02

Large Visual Language Models show promising results in VQA tasks.

03

Identified key challenges and open questions for future research.

Abstract

Visual Question Answering (VQA) is an evolving research field aimed at enabling machines to answer questions about visual content by integrating image and language processing techniques such as feature extraction, object detection, text embedding, natural language understanding, and language generation. With the growth of multimodal data research, VQA has gained significant attention due to its broad applications, including interactive educational tools, medical image diagnosis, customer service, entertainment, and social media captioning. Additionally, VQA plays a vital role in assisting visually impaired individuals by generating descriptive content from images. This survey introduces a taxonomy of VQA architectures, categorizing them based on design choices and key components to facilitate comparative analysis and evaluation. We review major VQA approaches, focusing on deep…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsSoftmax · travel james · Attention Is All You Need