
TL;DR
This survey reviews existing VQA datasets, metrics, and models, highlighting recent advances in reasoning, scientific diagram understanding, and multimodal feature fusion techniques in the evolving field of visual question answering.
Contribution
It provides a comprehensive overview of current datasets, evaluation metrics, and models, emphasizing recent developments and challenges in VQA research.
Findings
Extensive review of VQA datasets and metrics
Analysis of recent models and their capabilities
Identification of challenges and future directions in VQA
Abstract
Visual question answering (VQA) is a task that combines both the techniques of computer vision and natural language processing. It requires models to answer a text-based question according to the information contained in a visual. In recent years, the research field of VQA has been expanded. Research that focuses on the VQA, examining the reasoning ability and VQA on scientific diagrams, has also been explored more. Meanwhile, more multimodal feature fusion mechanisms have been proposed. This paper will review and analyze existing datasets, metrics, and models proposed for the VQA task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
