Learning to embed semantic similarity for joint image-text retrieval
Noam Malali, Yosi Keller

TL;DR
This paper introduces a deep learning method for joint image-text semantic embedding using a novel metric learning scheme with multitask learning, center loss, and adaptive margin hinge loss, improving retrieval performance.
Contribution
It proposes a new end-to-end trainable framework with differentiable quantization and adaptive margin loss for better semantic embedding of images and captions.
Findings
Outperforms state-of-the-art on MS-COCO, Flickr30K, Flickr8K datasets.
Effective semantic similarity approximation via Euclidean space.
Improved image-text retrieval accuracy.
Abstract
We present a deep learning approach for learning the joint semantic embeddings of images and captions in a Euclidean space, such that the semantic similarity is approximated by the L2 distances in the embedding space. For that, we introduce a metric learning scheme that utilizes multitask learning to learn the embedding of identical semantic concepts using a center loss. By introducing a differentiable quantization scheme into the end-to-end trainable network, we derive a semantic embedding of semantically similar concepts in Euclidean space. We also propose a novel metric learning formulation using an adaptive margin hinge loss, that is refined during the training phase. The proposed scheme was applied to the MS-COCO, Flicke30K and Flickr8K datasets, and was shown to compare favorably with contemporary state-of-the-art approaches.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
