Sentence Attention Blocks for Answer Grounding
Seyedalireza Khoshsirat, Chandra Kambhamettu

TL;DR
This paper introduces the Sentence Attention Block, a novel and flexible architectural component that improves answer grounding in visual question answering by explicitly modeling dependencies between image features and sentence embeddings, achieving state-of-the-art results.
Contribution
The paper proposes a new Sentence Attention Block that enhances answer grounding by re-calibrating image features based on sentence context, compatible with pre-trained networks and easy to implement.
Findings
Achieved state-of-the-art accuracy on multiple datasets.
Demonstrated the effectiveness through ablation studies.
Flexible integration with various backbone networks.
Abstract
Answer grounding is the task of locating relevant visual evidence for the Visual Question Answering task. While a wide variety of attention methods have been introduced for this task, they suffer from the following three problems: designs that do not allow the usage of pre-trained networks and do not benefit from large data pre-training, custom designs that are not based on well-grounded previous designs, therefore limiting the learning power of the network, or complicated designs that make it challenging to re-implement or improve them. In this paper, we propose a novel architectural block, which we term Sentence Attention Block, to solve these problems. The proposed block re-calibrates channel-wise image feature-maps by explicitly modeling inter-dependencies between the image feature-maps and sentence embedding. We visually demonstrate how this block filters out irrelevant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
