Measuring Machine Intelligence Through Visual Question Answering
C. Lawrence Zitnick, Aishwarya Agrawal, Stanislaw Antol, Margaret, Mitchell, Dhruv Batra, Devi Parikh

TL;DR
This paper advocates for using Visual Question Answering as a robust, scalable method to measure machine intelligence, highlighting a large dataset for evaluation.
Contribution
It introduces a large-scale VQA dataset with over 760,000 questions and 10 million answers, providing a new benchmark for assessing machine reasoning abilities.
Findings
VQA is a promising task for measuring machine intelligence.
The dataset enables scalable and objective evaluation.
VQA surpasses image captioning in assessing reasoning skills.
Abstract
As machines have become more intelligent, there has been a renewed interest in methods for measuring their intelligence. A common approach is to propose tasks for which a human excels, but one which machines find difficult. However, an ideal task should also be easy to evaluate and not be easily gameable. We begin with a case study exploring the recently popular task of image captioning and its limitations as a task for measuring machine intelligence. An alternative and more promising task is Visual Question Answering that tests a machine's ability to reason about language and vision. We describe a dataset unprecedented in size created for the task that contains over 760,000 human generated questions about images. Using around 10 million human generated answers, machines may be easily evaluated.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
