Answer Questions with Right Image Regions: A Visual Attention Regularization Approach
Yibing Liu, Yangyang Guo, Jianhua Yin, Xuemeng Song, Weifeng Liu,, Liqiang Nie

TL;DR
This paper introduces AttReg, a flexible visual attention regularization method for VQA that improves visual grounding without requiring human attention data, leading to state-of-the-art results.
Contribution
The novel AttReg approach enhances visual attention in VQA models by focusing on ignored key regions without human supervision, improving accuracy across multiple datasets.
Findings
Achieved 60.00% accuracy on VQA-CP v2, a new state-of-the-art.
AttReg improves visual grounding and reasoning in VQA models.
Effective across three benchmark datasets.
Abstract
Visual attention in Visual Question Answering (VQA) targets at locating the right image regions regarding the answer prediction, offering a powerful technique to promote multi-modal understanding. However, recent studies have pointed out that the highlighted image regions from the visual attention are often irrelevant to the given question and answer, leading to model confusion for correct visual reasoning. To tackle this problem, existing methods mostly resort to aligning the visual attention weights with human attentions. Nevertheless, gathering such human data is laborious and expensive, making it burdensome to adapt well-developed models across datasets. To address this issue, in this paper, we devise a novel visual attention regularization approach, namely AttReg, for better visual grounding in VQA. Specifically, AttReg firstly identifies the image regions which are essential for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
