TL;DR
This paper introduces a method for learning multimodal image and text embeddings from Web and Social Media data, enabling semantic image retrieval without supervision and outperforming existing methods on several benchmarks.
Contribution
It presents a novel approach to learn from web data for semantic image retrieval, including a new dataset for benchmarking and analysis of different text embeddings.
Findings
Embeddings learned from web data outperform supervised methods in text-based image retrieval.
The approach achieves state-of-the-art results on the MIRFlickr dataset.
Semantic multimodal retrieval extends beyond classical instance-level retrieval.
Abstract
In this paper we propose to learn a multimodal image and text embedding from Web and Social Media data, aiming to leverage the semantic knowledge learnt in the text domain and transfer it to a visual model for semantic image retrieval. We demonstrate that the pipeline can learn from images with associated text without supervision and perform a thourough analysis of five different text embeddings in three different benchmarks. We show that the embeddings learnt with Web and Social Media data have competitive performances over supervised methods in the text based image retrieval task, and we clearly outperform state of the art in the MIRFlickr dataset when training in the target data. Further we demonstrate how semantic multimodal image retrieval can be performed using the learnt embeddings, going beyond classical instance-level retrieval problems. Finally, we present a new dataset,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
