Contrastive Learning of Visual-Semantic Embeddings
Anurag Jain, Yashaswi Verma

TL;DR
This paper introduces two novel contrastive loss functions for learning joint visual-semantic embeddings, improving cross-modal image-text retrieval performance on MS-COCO and Flickr30K datasets.
Contribution
It proposes two normalized cross-entropy based contrastive losses tailored for batch training in multi-modal embedding tasks, with a focus on negative sampling strategies.
Findings
Outperforms state-of-the-art on MS-COCO dataset
Achieves comparable results on Flickr30K dataset
Demonstrates effectiveness of negative sampling strategies in contrastive learning
Abstract
Contrastive learning is a powerful technique to learn representations that are semantically distinctive and geometrically invariant. While most of the earlier approaches have demonstrated its effectiveness on single-modality learning tasks such as image classification, recently there have been a few attempts towards extending this idea to multi-modal data. In this paper, we propose two loss functions based on normalized cross-entropy to perform the task of learning joint visual-semantic embedding using batch contrastive training. In a batch, for a given anchor point from one modality, we consider its negatives only from another modality, and define our first contrastive loss based on expected violations incurred by all the negatives. Next, we update this loss and define the second contrastive loss based on the violation incurred only by the hardest negative. We compare our results with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
