Contrastive Learning of Visual-Semantic Embeddings

Anurag Jain; Yashaswi Verma

arXiv:2110.08872·cs.CV·October 19, 2021

Contrastive Learning of Visual-Semantic Embeddings

Anurag Jain, Yashaswi Verma

PDF

Open Access

TL;DR

This paper introduces two novel contrastive loss functions for learning joint visual-semantic embeddings, improving cross-modal image-text retrieval performance on MS-COCO and Flickr30K datasets.

Contribution

It proposes two normalized cross-entropy based contrastive losses tailored for batch training in multi-modal embedding tasks, with a focus on negative sampling strategies.

Findings

01

Outperforms state-of-the-art on MS-COCO dataset

02

Achieves comparable results on Flickr30K dataset

03

Demonstrates effectiveness of negative sampling strategies in contrastive learning

Abstract

Contrastive learning is a powerful technique to learn representations that are semantically distinctive and geometrically invariant. While most of the earlier approaches have demonstrated its effectiveness on single-modality learning tasks such as image classification, recently there have been a few attempts towards extending this idea to multi-modal data. In this paper, we propose two loss functions based on normalized cross-entropy to perform the task of learning joint visual-semantic embedding using batch contrastive training. In a batch, for a given anchor point from one modality, we consider its negatives only from another modality, and define our first contrastive loss based on expected violations incurred by all the negatives. Next, we update this loss and define the second contrastive loss based on the violation incurred only by the hardest negative. We compare our results with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques