Weakly-supervised Visual Grounding of Phrases with Linguistic Structures
Fanyi Xiao, Leonid Sigal, Yong Jae Lee

TL;DR
This paper introduces a weakly-supervised method for visual grounding of phrases in images using image-sentence pairs, leveraging linguistic parse trees to improve localization without explicit annotations.
Contribution
It presents a novel structural loss that utilizes sentence parse trees to enhance phrase grounding, combining it with standard discriminative loss in an end-to-end model.
Findings
Effective on Microsoft COCO dataset
Outperforms existing weakly-supervised methods
Utilizes parse tree structures for better localization
Abstract
We propose a weakly-supervised approach that takes image-sentence pairs as input and learns to visually ground (i.e., localize) arbitrary linguistic phrases, in the form of spatial attention masks. Specifically, the model is trained with images and their associated image-level captions, without any explicit region-to-phrase correspondence annotations. To this end, we introduce an end-to-end model which learns visual groundings of phrases with two types of carefully designed loss functions. In addition to the standard discriminative loss, which enforces that attended image regions and phrases are consistently encoded, we propose a novel structural loss which makes use of the parse tree structures induced by the sentences. In particular, we ensure complementarity among the attention masks that correspond to sibling noun phrases, and compositionality of attention masks among the children…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
