CLAIR: CLIP-Aided Weakly Supervised Zero-Shot Cross-Domain Image Retrieval
Chor Boon Tan, Conghui Hu, Gim Hee Lee

TL;DR
CLAIR leverages CLIP-generated pseudo-labels and contrastive learning to improve weakly supervised zero-shot cross-domain image retrieval, effectively handling noisy labels and domain discrepancies.
Contribution
This paper introduces CLAIR, a novel framework that refines pseudo-labels with confidence scores, employs contrastive losses, and uses a cross-domain mapping with learnable prompts to enhance zero-shot image retrieval.
Findings
CLAIR outperforms existing methods on multiple zero-shot datasets.
The confidence-based pseudo-label refinement improves retrieval accuracy.
Cross-domain mapping with CLIP embeddings effectively reduces domain gaps.
Abstract
The recent growth of large foundation models that can easily generate pseudo-labels for huge quantity of unlabeled data makes unsupervised Zero-Shot Cross-Domain Image Retrieval (UZS-CDIR) less relevant. In this paper, we therefore turn our attention to weakly supervised ZS-CDIR (WSZS-CDIR) with noisy pseudo labels generated by large foundation models such as CLIP. To this end, we propose CLAIR to refine the noisy pseudo-labels with a confidence score from the similarity between the CLIP text and image features. Furthermore, we design inter-instance and inter-cluster contrastive losses to encode images into a class-aware latent space, and an inter-domain contrastive loss to alleviate domain discrepancies. We also learn a novel cross-domain mapping function in closed-form, using only CLIP text embeddings to project image features from one domain to another, thereby further aligning the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications
