Contrastive Learning for Weakly Supervised Phrase Grounding

Tanmay Gupta; Arash Vahdat; Gal Chechik; Xiaodong Yang; Jan Kautz; and; Derek Hoiem

arXiv:2006.09920·cs.CV·August 7, 2020

Contrastive Learning for Weakly Supervised Phrase Grounding

Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, and, Derek Hoiem

PDF

1 Repo 1 Models

TL;DR

This paper introduces a contrastive learning approach for weakly supervised phrase grounding, leveraging mutual information maximization and language model-guided negatives to improve accuracy in associating image regions with caption words.

Contribution

It proposes a novel contrastive learning framework that uses language model-guided negative sampling to enhance weakly supervised phrase grounding performance.

Findings

01

Achieves approximately 10% accuracy gain over random negatives.

02

Improves Flickr30K Entities accuracy to 76.7%.

03

Demonstrates effectiveness of mutual information maximization in vision-language tasks.

Abstract

Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions. A key idea is to construct effective negative captions for learning through language model guided word substitutions. Training with our negatives yields a $\sim 10%$ absolute gain in accuracy over randomly-sampled negatives from the training data. Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of $5.7%$ to achieve $76.7%$ accuracy on Flickr30K Entities benchmark.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BigRedT/info-ground
pytorchOfficial

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.