Learning to embed semantic similarity for joint image-text retrieval

Noam Malali; Yosi Keller

arXiv:2210.03838·cs.CV·October 11, 2022

Learning to embed semantic similarity for joint image-text retrieval

Noam Malali, Yosi Keller

PDF

TL;DR

This paper introduces a deep learning method for joint image-text semantic embedding using a novel metric learning scheme with multitask learning, center loss, and adaptive margin hinge loss, improving retrieval performance.

Contribution

It proposes a new end-to-end trainable framework with differentiable quantization and adaptive margin loss for better semantic embedding of images and captions.

Findings

01

Outperforms state-of-the-art on MS-COCO, Flickr30K, Flickr8K datasets.

02

Effective semantic similarity approximation via Euclidean space.

03

Improved image-text retrieval accuracy.

Abstract

We present a deep learning approach for learning the joint semantic embeddings of images and captions in a Euclidean space, such that the semantic similarity is approximated by the L2 distances in the embedding space. For that, we introduce a metric learning scheme that utilizes multitask learning to learn the embedding of identical semantic concepts using a center loss. By introducing a differentiable quantization scheme into the end-to-end trainable network, we derive a semantic embedding of semantically similar concepts in Euclidean space. We also propose a novel metric learning formulation using an adaptive margin hinge loss, that is refined during the training phase. The proposed scheme was applied to the MS-COCO, Flicke30K and Flickr8K datasets, and was shown to compare favorably with contemporary state-of-the-art approaches.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.