TL;DR
AttnGrounder is an end-to-end model that improves visual grounding by using a visual-text attention module to relate words to image regions and generate attention masks, leading to better localization.
Contribution
It introduces a novel attention-based approach that relates each word to image regions and uses auxiliary attention masks for enhanced localization in visual grounding.
Findings
Achieved 3.26% improvement on Talk2Car dataset
Uses a visual-text attention module for better region-word relation
Employs auxiliary attention masks for improved localization
Abstract
We propose Attention Grounder (AttnGrounder), a single-stage end-to-end trainable model for the task of visual grounding. Visual grounding aims to localize a specific object in an image based on a given natural language text query. Unlike previous methods that use the same text representation for every image region, we use a visual-text attention module that relates each word in the given query with every region in the corresponding image for constructing a region dependent text representation. Furthermore, for improving the localization ability of our model, we use our visual-text attention module to generate an attention mask around the referred object. The attention mask is trained as an auxiliary task using a rectangular mask generated with the provided ground-truth coordinates. We evaluate AttnGrounder on the Talk2Car dataset and show an improvement of 3.26% over the existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
