Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA
Badri N. Patro, Anupriy, Vinay P. Namboodiri

TL;DR
This paper introduces an adversarial training approach to improve attention mechanisms in VQA by aligning attention maps with visual explanations, resulting in more human-like attention and better task performance.
Contribution
It proposes a novel adversarial framework that uses visual explanations as supervision to enhance attention maps in VQA models, outperforming other distribution alignment methods.
Findings
Attention maps become more aligned with human attention.
Significant improvement in VQA accuracy and rank correlation.
Adversarial loss outperforms other distribution matching losses.
Abstract
In this paper, we aim to obtain improved attention for a visual question answering (VQA) task. It is challenging to provide supervision for attention. An observation we make is that visual explanations as obtained through class activation mappings (specifically Grad-CAM) that are meant to explain the performance of various networks could form a means of supervision. However, as the distributions of attention maps and that of Grad-CAMs differ, it would not be suitable to directly use these as a form of supervision. Rather, we propose the use of a discriminator that aims to distinguish samples of visual explanation and attention maps. The use of adversarial training of the attention regions as a two-player game between attention and explanation serves to bring the distributions of attention maps and visual explanations closer. Significantly, we observe that providing such a means of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
