Learning Deep Structure-Preserving Image-Text Embeddings
Liwei Wang, Yin Li, Svetlana Lazebnik

TL;DR
This paper introduces a deep neural network approach for learning joint image-text embeddings that preserve structure, significantly improving retrieval accuracy and achieving state-of-the-art results on multiple datasets.
Contribution
The paper presents a novel two-branch neural network with a combined ranking and neighborhood preservation loss for joint image-text embedding learning.
Findings
Achieves new state-of-the-art on Flickr30K and MSCOCO datasets.
Significant improvements in image-to-text and text-to-image retrieval accuracy.
Shows potential for phrase localization tasks.
Abstract
This paper proposes a method for learning joint embeddings of images and text using a two-branch neural network with multiple layers of linear projections followed by nonlinearities. The network is trained using a large margin objective that combines cross-view ranking constraints with within-view neighborhood structure preservation constraints inspired by metric learning literature. Extensive experiments show that our approach gains significant improvements in accuracy for image-to-text and text-to-image retrieval. Our method achieves new state-of-the-art results on the Flickr30K and MSCOCO image-sentence datasets and shows promise on the new task of phrase localization on the Flickr30K Entities dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
