Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions
Prajwal Gatti, Kshitij Parikh, Dhriti Prasanna Paul, Manish Gupta,, Anand Mishra

TL;DR
This paper introduces a new multimodal retrieval task combining sketches and text to find elusive objects in images, along with a large dataset and a transformer-based baseline model that outperforms existing methods.
Contribution
The paper defines a novel composite sketch+text image retrieval problem, creates a large dataset CSTBIR, and proposes a pretrained multimodal transformer model, STNET, with improved training objectives.
Findings
STNET outperforms state-of-the-art retrieval methods
CSTBIR dataset contains 2 million queries and 108K images
Proposed training objectives enhance model performance
Abstract
Non-native speakers with limited vocabulary often struggle to name specific objects despite being able to visualize them, e.g., people outside Australia searching for numbats. Further, users may want to search for such elusive objects with difficult-to-sketch interactions, e.g., numbat digging in the ground. In such common but complex situations, users desire a search interface that accepts composite multimodal queries comprising hand-drawn sketches of difficult-to-name but easy-to-draw objects and text describing difficult-to-sketch but easy-to-verbalize object attributes or interaction with the scene. This novel problem statement distinctly differs from the previously well-researched TBIR (text-based image retrieval) and SBIR (sketch-based image retrieval) problems. To study this under-explored task, we curate a dataset, CSTBIR (Composite Sketch+Text Based Image Retrieval), consisting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
