CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not
Aneeshan Sain, Ayan Kumar Bhunia, Pinaki Nath Chowdhury, Subhadeep, Koley, Tao Xiang, Yi-Zhe Song

TL;DR
This paper adapts CLIP with prompt learning for zero-shot sketch-based image retrieval, achieving significant improvements in category-level and fine-grained settings by introducing novel regularization and patch shuffling techniques.
Contribution
It presents a novel prompt learning approach tailored for CLIP to enhance zero-shot sketch-based image retrieval, including new methods for fine-grained matching.
Findings
24.8% improvement in category-level ZS-SBIR over prior arts
26.9% performance gain in fine-grained ZS-SBIR
Prompt learning with regularization and patch shuffling enhances retrieval accuracy
Abstract
In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting ("all"). At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that overshoots all prior arts, by a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and requires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
MethodsContrastive Language-Image Pre-training
