A Comprehensive Survey on Visual Question Answering Datasets and   Algorithms

Raihan Kabir; Naznin Haque; Md Saiful Islam; Marium-E-Jannat

arXiv:2411.11150·cs.CV·November 19, 2024·2 cites

A Comprehensive Survey on Visual Question Answering Datasets and Algorithms

Raihan Kabir, Naznin Haque, Md Saiful Islam, Marium-E-Jannat

PDF

Open Access

TL;DR

This survey comprehensively reviews VQA datasets and models, categorizing datasets into four types and analyzing six main model paradigms, highlighting current challenges and future directions in visual question answering.

Contribution

It provides a detailed taxonomy of VQA datasets and models, summarizing methodologies and characteristics to guide future research in the field.

Findings

01

Identifies four categories of VQA datasets: authentic, synthetic, diagnostic, and knowledge-based.

02

Analyzes six main paradigms of VQA models: fusion, attention, external knowledge, reasoning, explanation, and graph models.

03

Discusses additional topics like scene text understanding, counting, and bias reduction.

Abstract

Visual question answering (VQA) refers to the problem where, given an image and a natural language question about the image, a correct natural language answer has to be generated. A VQA model has to demonstrate both the visual understanding of the image and the semantic understanding of the question, demonstrating reasoning capability. Since the inception of this field, a plethora of VQA datasets and models have been published. In this article, we meticulously analyze the current state of VQA datasets and models, while cleanly dividing them into distinct categories and then summarizing the methodologies and characteristics of each category. We divide VQA datasets into four categories: (1) available datasets that contain a rich collection of authentic images, (2) synthetic datasets that contain only synthetic images produced through artificial means, (3) diagnostic datasets that are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition