Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, Devi Parikh

TL;DR
This paper introduces a balanced VQA dataset with paired images per question to reduce language bias, benchmarks models on it, and develops an interpretable model providing counter-example explanations to improve trust.
Contribution
It creates a more balanced VQA dataset with paired images, benchmarks models on it, and proposes a novel interpretable model with counter-example explanations.
Findings
Models perform worse on the balanced dataset, indicating reliance on language priors.
The balanced dataset reveals the true difficulty of VQA tasks.
The interpretable model provides counter-example explanations, enhancing trust.
Abstract
Problems at the intersection of vision and language are of significant importance both as challenging research questions and for the rich set of applications they enable. However, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in models that ignore visual information, leading to an inflated sense of their capability. We propose to counter these language priors for the task of Visual Question Answering (VQA) and make vision (the V in VQA) matter! Specifically, we balance the popular VQA dataset by collecting complementary images such that every question in our balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Our dataset is by construction more balanced than the original VQA dataset and has…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
