AttnGrounder: Talking to Cars with Attention

Vivek Mittal

arXiv:2009.05684·cs.CV·December 14, 2020

AttnGrounder: Talking to Cars with Attention

Vivek Mittal

PDF

1 Repo

TL;DR

AttnGrounder is an end-to-end model that improves visual grounding by using a visual-text attention module to relate words to image regions and generate attention masks, leading to better localization.

Contribution

It introduces a novel attention-based approach that relates each word to image regions and uses auxiliary attention masks for enhanced localization in visual grounding.

Findings

01

Achieved 3.26% improvement on Talk2Car dataset

02

Uses a visual-text attention module for better region-word relation

03

Employs auxiliary attention masks for improved localization

Abstract

We propose Attention Grounder (AttnGrounder), a single-stage end-to-end trainable model for the task of visual grounding. Visual grounding aims to localize a specific object in an image based on a given natural language text query. Unlike previous methods that use the same text representation for every image region, we use a visual-text attention module that relates each word in the given query with every region in the corresponding image for constructing a region dependent text representation. Furthermore, for improving the localization ability of our model, we use our visual-text attention module to generate an attention mask around the referred object. The attention mask is trained as an auxiliary task using a rectangular mask generated with the provided ground-truth coordinates. We evaluate AttnGrounder on the Talk2Car dataset and show an improvement of 3.26% over the existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

i-m-vivek/AttnGrounder
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.