Weakly-supervised Visual Grounding of Phrases with Linguistic Structures

Fanyi Xiao; Leonid Sigal; Yong Jae Lee

arXiv:1705.01371·cs.CV·May 4, 2017·21 cites

Weakly-supervised Visual Grounding of Phrases with Linguistic Structures

Fanyi Xiao, Leonid Sigal, Yong Jae Lee

PDF

Open Access

TL;DR

This paper introduces a weakly-supervised method for visual grounding of phrases in images using image-sentence pairs, leveraging linguistic parse trees to improve localization without explicit annotations.

Contribution

It presents a novel structural loss that utilizes sentence parse trees to enhance phrase grounding, combining it with standard discriminative loss in an end-to-end model.

Findings

01

Effective on Microsoft COCO dataset

02

Outperforms existing weakly-supervised methods

03

Utilizes parse tree structures for better localization

Abstract

We propose a weakly-supervised approach that takes image-sentence pairs as input and learns to visually ground (i.e., localize) arbitrary linguistic phrases, in the form of spatial attention masks. Specifically, the model is trained with images and their associated image-level captions, without any explicit region-to-phrase correspondence annotations. To this end, we introduce an end-to-end model which learns visual groundings of phrases with two types of carefully designed loss functions. In addition to the standard discriminative loss, which enforces that attended image regions and phrases are consistently encoded, we propose a novel structural loss which makes use of the parse tree structures induced by the sentences. In particular, we ensure complementarity among the attention masks that correspond to sibling noun phrases, and compositionality of attention masks among the children…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques