Visual Grounding Methods for VQA are Working for the Wrong Reasons!

Robik Shrestha; Kushal Kafle; Christopher Kanan

arXiv:2004.05704·cs.CV·April 24, 2024·5 cites

Visual Grounding Methods for VQA are Working for the Wrong Reasons!

Robik Shrestha, Kushal Kafle, Christopher Kanan

PDF

Open Access 1 Repo

TL;DR

This paper reveals that current visual grounding techniques in VQA improve performance mainly through regularization rather than true visual grounding, and proposes a simple, annotation-free regularization method that achieves near state-of-the-art results.

Contribution

The paper demonstrates that visual grounding methods' performance gains are due to regularization effects and introduces a straightforward, annotation-free regularization approach for VQA.

Findings

01

Random cues yield similar performance improvements as human attention maps.

02

Proposed regularization method achieves near state-of-the-art results on VQA-CPv2.

03

Visual grounding methods may not be the true reason for improved VQA performance.

Abstract

Existing Visual Question Answering (VQA) methods tend to exploit dataset biases and spurious statistical correlations, instead of producing right answers for the right reasons. To address this issue, recent bias mitigation methods for VQA propose to incorporate visual cues (e.g., human attention maps) to better ground the VQA models, showcasing impressive gains. However, we show that the performance improvements are not a result of improved visual grounding, but a regularization effect which prevents over-fitting to linguistic priors. For instance, we find that it is not actually necessary to provide proper, human-based cues; random, insensible cues also result in similar improvements. Based on this observation, we propose a simpler regularization scheme that does not require any external annotations and yet achieves near state-of-the-art performance on VQA-CPv2.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

erobic/negative_analysis_of_grounding
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning