Exploring Human-like Attention Supervision in Visual Question Answering
Tingting Qiao, Jianfeng Dong, Duanqing Xu

TL;DR
This paper introduces a method to generate human-like attention maps for VQA models using a new dataset and demonstrates that supervision with these maps improves model accuracy and attention quality.
Contribution
The work proposes a Human Attention Network trained on a new dataset to generate human-like attention maps, enhancing VQA models through explicit attention supervision.
Findings
Human-like attention supervision improves VQA accuracy.
Generated attention maps align better with human attention.
Supervision leads to more accurate and interpretable attention maps.
Abstract
Attention mechanisms have been widely applied in the Visual Question Answering (VQA) task, as they help to focus on the area-of-interest of both visual and textual information. To answer the questions correctly, the model needs to selectively target different areas of an image, which suggests that an attention-based model may benefit from an explicit attention supervision. In this work, we aim to address the problem of adding attention supervision to VQA models. Since there is a lack of human attention data, we first propose a Human Attention Network (HAN) to generate human-like attention maps, training on a recently released dataset called Human ATtention Dataset (VQA-HAT). Then, we apply the pre-trained HAN on the VQA v2.0 dataset to automatically produce the human-like attention maps for all image-question pairs. The generated human-like attention map dataset for the VQA v2.0 dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
