TL;DR
This paper critiques current VQA evaluation methods, introduces the GQA-OOD benchmark to better assess reasoning, and demonstrates that existing models struggle with infrequent concepts, highlighting the need for improved approaches.
Contribution
The paper proposes the GQA-OOD benchmark to evaluate VQA models on rare and frequent questions, emphasizing reasoning over dataset bias exploitation.
Findings
Models perform poorly on infrequent concepts.
Standard accuracy metrics are misleading for reasoning evaluation.
Bias reduction techniques have limited success on rare questions.
Abstract
Models for Visual Question Answering (VQA) are notorious for their tendency to rely on dataset biases, as the large and unbalanced diversity of questions and concepts involved and tends to prevent models from learning to reason, leading them to perform educated guesses instead. In this paper, we claim that the standard evaluation metric, which consists in measuring the overall in-domain accuracy, is misleading. Since questions and concepts are unbalanced, this tends to favor models which exploit subtle training set statistics. Alternatively, naively introducing artificial distribution shifts between train and test splits is also not completely satisfying. First, the shifts do not reflect real-world tendencies, resulting in unsuitable models; second, since the shifts are handcrafted, trained models are specifically designed for this particular setting, and do not generalize to other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
