Learning Cross-Modal Deep Embeddings for Multi-Object Image Retrieval using Text and Sketch
Sounak Dey, Anjan Dutta, Suman K. Ghosh, Ernest Valveny and, Josep Llad\'os, Umapada Pal

TL;DR
This paper presents a novel cross-modal deep learning framework that enables image retrieval using text and sketches as queries, effectively handling multiple objects through an attention mechanism and outperforming existing methods.
Contribution
It introduces a joint embedding model for text, sketches, and images, with an attention mechanism for multi-object retrieval, advancing cross-modal retrieval capabilities.
Findings
Achieves state-of-the-art performance on standard datasets.
Effectively handles multi-object image retrieval.
Demonstrates the benefit of attention in cross-modal retrieval.
Abstract
In this work we introduce a cross modal image retrieval system that allows both text and sketch as input modalities for the query. A cross-modal deep network architecture is formulated to jointly model the sketch and text input modalities as well as the the image output modality, learning a common embedding between text and images and between sketches and images. In addition, an attention model is used to selectively focus the attention on the different objects of the image, allowing for retrieval with multiple objects in the query. Experiments show that the proposed method performs the best in both single and multiple object image retrieval in standard datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
