TL;DR
This paper introduces a novel method for localizing textual phrases in images by reconstructing the phrase through an attention mechanism, enabling learning with minimal supervision and achieving state-of-the-art results.
Contribution
The proposed approach learns grounding via phrase reconstruction using attention, effective even with limited supervision, and outperforms existing methods on benchmark datasets.
Findings
Effective with no or limited grounding supervision
Significant improvement over state-of-the-art methods
Works well on multiple datasets with different supervision levels
Abstract
Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth spatial localization of phrases, thus it is desirable to learn from data with no or little grounding supervision. We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly. During training our approach encodes the phrase using a recurrent network language model and then learns to attend to the relevant image region in order to reconstruct the input phrase. At test time, the correct attention, i.e., the grounding, is evaluated. If grounding supervision is available it can be directly applied via a loss over the attention mechanism. We demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
