From Image to Language: A Critical Analysis of Visual Question Answering   (VQA) Approaches, Challenges, and Opportunities

Md Farhan Ishmam; Md Sakib Hossain Shovon; M.F. Mridha; Nilanjan Dey

arXiv:2311.00308·cs.CV·November 5, 2024·2 cites

From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities

Md Farhan Ishmam, Md Sakib Hossain Shovon, M.F. Mridha, Nilanjan Dey

PDF

Open Access

TL;DR

This paper provides a comprehensive survey of Visual Question Answering (VQA), covering its evolution, datasets, methods, challenges, and future directions in multimodal AI, highlighting the shift from traditional to vision-language pre-training approaches.

Contribution

It offers a detailed taxonomy of VQA, analyzes historical and recent trends, and identifies open problems and future research opportunities in the field.

Findings

01

VQA datasets have expanded to include diverse visual inputs.

02

Vision-language pre-training has shifted the landscape of VQA methods.

03

Several open challenges remain in dataset diversity, model robustness, and multimodal reasoning.

Abstract

The multimodal task of Visual Question Answering (VQA) encompassing elements of Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers to questions on any visual input. Over time, the scope of VQA has expanded from datasets focusing on an extensive collection of natural images to datasets featuring synthetic images, video, 3D environments, and various other visual inputs. The emergence of large pre-trained networks has shifted the early VQA approaches relying on feature extraction and fusion schemes to vision language pre-training (VLP) techniques. However, there is a lack of comprehensive surveys that encompass both traditional VQA architectures and contemporary VLP-based methods. Furthermore, the VLP challenges in the lens of VQA haven't been thoroughly explored, leaving room for potential open problems to emerge. Our work presents a survey in the domain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsSparse Evolutionary Training