Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal   Prompt Learning

Mainak Singha; Ankit Jha; Divyam Gupta; Pranav Singla; Biplab Banerjee

arXiv:2407.04207·cs.CV·July 24, 2024

Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal Prompt Learning

Mainak Singha, Ankit Jha, Divyam Gupta, Pranav Singla, Biplab Banerjee

PDF

Open Access 1 Repo

TL;DR

This paper introduces SpLIP, a multi-modal prompt learning approach leveraging CLIP's visual and textual capabilities to improve zero-shot sketch-based image retrieval across various challenging settings.

Contribution

SpLIP is the first multi-modal prompt learning scheme with bi-directional prompt sharing for SBIR, enhancing cross-modal alignment and generalization.

Findings

01

SpLIP outperforms existing methods on multiple SBIR benchmarks.

02

The bi-directional prompt sharing improves semantic alignment between sketches and photos.

03

Adaptive margin and conditional cross-modal jigsaw strategies further boost retrieval accuracy.

Abstract

We address the challenges inherent in sketch-based image retrieval (SBIR) across various settings, including zero-shot SBIR, generalized zero-shot SBIR, and fine-grained zero-shot SBIR, by leveraging the vision-language foundation model CLIP. While recent endeavors have employed CLIP to enhance SBIR, these approaches predominantly follow uni-modal prompt processing and overlook to exploit CLIP's integrated visual and textual capabilities fully. To bridge this gap, we introduce SpLIP, a novel multi-modal prompt learning scheme designed to operate effectively with frozen CLIP backbones. We diverge from existing multi-modal prompting methods that treat visual and textual prompts independently or integrate them in a limited fashion, leading to suboptimal generalization. SpLIP implements a bi-directional prompt-sharing strategy that enables mutual knowledge exchange between CLIP's visual and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mainaksingha01/splip
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques

MethodsContrastive Language-Image Pre-training