A survey on VQA_Datasets and Approaches

Yeyun Zou; Qiyu Xie

arXiv:2105.00421·cs.CV·May 4, 2021

A survey on VQA_Datasets and Approaches

Yeyun Zou, Qiyu Xie

PDF

TL;DR

This survey reviews existing VQA datasets, metrics, and models, highlighting recent advances in reasoning, scientific diagram understanding, and multimodal feature fusion techniques in the evolving field of visual question answering.

Contribution

It provides a comprehensive overview of current datasets, evaluation metrics, and models, emphasizing recent developments and challenges in VQA research.

Findings

01

Extensive review of VQA datasets and metrics

02

Analysis of recent models and their capabilities

03

Identification of challenges and future directions in VQA

Abstract

Visual question answering (VQA) is a task that combines both the techniques of computer vision and natural language processing. It requires models to answer a text-based question according to the information contained in a visual. In recent years, the research field of VQA has been expanded. Research that focuses on the VQA, examining the reasoning ability and VQA on scientific diagrams, has also been explored more. Meanwhile, more multimodal feature fusion mechanisms have been proposed. This paper will review and analyze existing datasets, metrics, and models proposed for the VQA task.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.