Grounding of Textual Phrases in Images by Reconstruction

Anna Rohrbach; Marcus Rohrbach; Ronghang Hu; Trevor Darrell; Bernt; Schiele

arXiv:1511.03745·cs.CV·February 21, 2017

Grounding of Textual Phrases in Images by Reconstruction

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, Bernt, Schiele

PDF

3 Repos 1 Models

TL;DR

This paper introduces a novel method for localizing textual phrases in images by reconstructing the phrase through an attention mechanism, enabling learning with minimal supervision and achieving state-of-the-art results.

Contribution

The proposed approach learns grounding via phrase reconstruction using attention, effective even with limited supervision, and outperforms existing methods on benchmark datasets.

Findings

01

Effective with no or limited grounding supervision

02

Significant improvement over state-of-the-art methods

03

Works well on multiple datasets with different supervision levels

Abstract

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth spatial localization of phrases, thus it is desirable to learn from data with no or little grounding supervision. We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly. During training our approach encodes the phrase using a recurrent network language model and then learns to attend to the relevant image region in order to reconstruct the input phrase. At test time, the correct attention, i.e., the grounding, is evaluated. If grounding supervision is available it can be directly applied via a loss over the attention mechanism. We demonstrate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.