You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval
Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath, Chowdhury, Tao Xiang, Yi-Zhe Song

TL;DR
This paper introduces a novel duet framework combining sketches and text for fine-grained image retrieval, enhancing precision and enabling new applications by leveraging pre-trained CLIP models without extensive textual descriptions.
Contribution
It proposes a compositionality framework that integrates sketches and text for fine-grained retrieval, extending capabilities beyond sketch-only methods using pre-trained CLIP models.
Findings
Enhanced retrieval precision with combined sketch and text inputs
Enabled fine-grained queries incorporating color and context
Extended to applications like image retrieval and attribute transfer
Abstract
Two primary input modalities prevail in image retrieval: sketch and text. While text is widely used for inter-category retrieval tasks, sketches have been established as the sole preferred modality for fine-grained image retrieval due to their ability to capture intricate visual details. In this paper, we question the reliance on sketches alone for fine-grained image retrieval by simultaneously exploring the fine-grained representation capabilities of both sketch and text, orchestrating a duet between the two. The end result enables precise retrievals previously unattainable, allowing users to pose ever-finer queries and incorporate attributes like colour and contextual cues from text. For this purpose, we introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models, while eliminating the need for extensive fine-grained textual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsContrastive Language-Image Pre-training
