A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino,, Roozbeh Mottaghi

TL;DR
A-OKVQA is a new challenging dataset for visual question answering that emphasizes questions requiring broad commonsense and world knowledge, aiming to advance AI reasoning capabilities beyond simple fact retrieval.
Contribution
The paper introduces A-OKVQA, a diverse dataset of 25K questions that demand complex reasoning and world knowledge, addressing limitations of previous VQA datasets.
Findings
State-of-the-art models perform poorly on A-OKVQA
Questions require reasoning beyond simple image queries
Dataset promotes development of more intelligent VQA systems
Abstract
The Visual Question Answering (VQA) task aspires to provide a meaningful testbed for the development of AI models that can jointly reason over visual and natural language inputs. Despite a proliferation of VQA datasets, this goal is hindered by a set of common limitations. These include a reliance on relatively simplistic questions that are repetitive in both concepts and linguistic structure, little world knowledge needed outside of the paired image, and limited reasoning required to arrive at the correct answer. We introduce A-OKVQA, a crowdsourced dataset composed of a diverse set of about 25K questions requiring a broad base of commonsense and world knowledge to answer. In contrast to the existing knowledge-based VQA datasets, the questions generally cannot be answered by simply querying a knowledge base, and instead require some form of commonsense reasoning about the scene…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsBalanced Selection
