VQA: Visual Question Answering
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C., Lawrence Zitnick, Dhruv Batra, Devi Parikh

TL;DR
This paper introduces the task of Visual Question Answering (VQA), a challenging AI problem requiring detailed image understanding and reasoning, supported by a large dataset and baseline methods for automatic evaluation.
Contribution
It defines the VQA task, provides a large dataset with nearly 0.25 million images and 10 million answers, and compares various baseline methods with human performance.
Findings
VQA requires complex reasoning beyond captioning.
The dataset enables automatic evaluation of open-ended questions.
Baseline methods show significant room for improvement.
Abstract
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (www.visualqa.org),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
