Transformers and CNNs both Beat Humans on SBIR
Omar Seddati, St\'ephane Dupont, Sa\"id Mahmoudi, Thierry Dutoit

TL;DR
This paper demonstrates that vision transformers significantly outperform CNNs and humans in sketch-based image retrieval, addressing flip invariance issues and proposing improved models for the SBIR task.
Contribution
It introduces modifications for better flip equivariance in SBIR models and shows that vision transformers outperform CNNs and humans on large-scale benchmarks.
Findings
Vision transformers outperform CNNs in SBIR.
Proposed modifications improve flip invariance.
Achieved state-of-the-art recall of 62.25% on Sketchy.
Abstract
Sketch-based image retrieval (SBIR) is the task of retrieving natural images (photos) that match the semantics and the spatial configuration of hand-drawn sketch queries. The universality of sketches extends the scope of possible applications and increases the demand for efficient SBIR solutions. In this paper, we study classic triplet-based SBIR solutions and show that a persistent invariance to horizontal flip (even after model finetuning) is harming performance. To overcome this limitation, we propose several approaches and evaluate in depth each of them to check their effectiveness. Our main contributions are twofold: We propose and evaluate several intuitive modifications to build SBIR solutions with better flip equivariance. We show that vision transformers are more suited for the SBIR task, and that they outperform CNNs with a large margin. We carried out numerous experiments and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection
MethodsFLIP
