Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style
Fengyin Lin, Mingkang Li, Da Li, Timothy Hospedales, Yi-Zhe Song,, Yonggang Qi

TL;DR
This paper introduces a unified, explainable transformer-based model for zero-shot sketch-based image retrieval that matches sketches to photos across all variants without external semantic knowledge.
Contribution
It proposes a novel cross-modal matching approach using local patch comparisons, enabling all ZS-SBIR variants with one network and providing interpretability.
Findings
Achieves superior performance across all ZS-SBIR settings.
Provides visual explanations via token correspondences.
Enables sketch-to-photo synthesis through patch replacement.
Abstract
This paper studies the problem of zero-short sketch-based image retrieval (ZS-SBIR), however with two significant differentiators to prior art (i) we tackle all variants (inter-category, intra-category, and cross datasets) of ZS-SBIR with just one network (``everything''), and (ii) we would really like to understand how this sketch-photo matching operates (``explainable''). Our key innovation lies with the realization that such a cross-modal matching problem could be reduced to comparisons of groups of key local patches -- akin to the seasoned ``bag-of-words'' paradigm. Just with this change, we are able to achieve both of the aforementioned goals, with the added benefit of no longer requiring external semantic knowledge. Technically, ours is a transformer-based cross-modal network, with three novel components (i) a self-attention module with a learnable tokenizer to produce visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsConcatenated Skip Connection · Softmax
