TL;DR
This paper develops a new evaluation method for visual question answering (VQA) that detects multimodal shortcut learning involving both questions and images, revealing that current models are often biased and perform poorly on these challenges.
Contribution
It introduces VQA-CounterExamples, an evaluation protocol for identifying multimodal shortcuts in VQA datasets, and demonstrates the ineffectiveness of existing bias mitigation techniques.
Findings
State-of-the-art models perform poorly on multimodal shortcut detection.
Existing bias reduction methods are largely ineffective against multimodal shortcuts.
Past focus on question-based biases overlooks the complexity of multimodal biases.
Abstract
We introduce an evaluation methodology for visual question answering (VQA) to better diagnose cases of shortcut learning. These cases happen when a model exploits spurious statistical regularities to produce correct answers but does not actually deploy the desired behavior. There is a need to identify possible shortcuts in a dataset and assess their use before deploying a model in the real world. The research community in VQA has focused exclusively on question-based shortcuts, where a model might, for example, answer "What is the color of the sky" with "blue" by relying mostly on the question-conditional training prior and give little weight to visual evidence. We go a step further and consider multimodal shortcuts that involve both questions and images. We first identify potential shortcuts in the popular VQA v2 training set by mining trivial predictive rules such as co-occurrences of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
