Visual Grounding Methods for VQA are Working for the Wrong Reasons!
Robik Shrestha, Kushal Kafle, Christopher Kanan

TL;DR
This paper reveals that current visual grounding techniques in VQA improve performance mainly through regularization rather than true visual grounding, and proposes a simple, annotation-free regularization method that achieves near state-of-the-art results.
Contribution
The paper demonstrates that visual grounding methods' performance gains are due to regularization effects and introduces a straightforward, annotation-free regularization approach for VQA.
Findings
Random cues yield similar performance improvements as human attention maps.
Proposed regularization method achieves near state-of-the-art results on VQA-CPv2.
Visual grounding methods may not be the true reason for improved VQA performance.
Abstract
Existing Visual Question Answering (VQA) methods tend to exploit dataset biases and spurious statistical correlations, instead of producing right answers for the right reasons. To address this issue, recent bias mitigation methods for VQA propose to incorporate visual cues (e.g., human attention maps) to better ground the VQA models, showcasing impressive gains. However, we show that the performance improvements are not a result of improved visual grounding, but a regularization effect which prevents over-fitting to linguistic priors. For instance, we find that it is not actually necessary to provide proper, human-based cues; random, insensible cues also result in similar improvements. Based on this observation, we propose a simpler regularization scheme that does not require any external annotations and yet achieves near state-of-the-art performance on VQA-CPv2.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
