SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval
Bhavin Jawade, Joao V. B. Soares, Kapil Thadani, Deen Dayal Mohan,, Amir Erfan Eshratifar, Benjamin Culpepper, Paloma de Juan, Srirangaraj, Setlur, Venu Govindaraju

TL;DR
SCOT introduces a self-supervised contrastive pretraining approach for zero-shot compositional image retrieval, leveraging large image-text datasets and language models to improve generalization without labeled triplets.
Contribution
The paper presents a novel zero-shot pretraining method for compositional retrieval that eliminates the need for labor-intensive triplet datasets by using large language models for supervision.
Findings
Outperforms state-of-the-art zero-shot methods on FashionIQ and CIRR.
Achieves competitive results compared to fully-supervised models.
Demonstrates strong generalization to unseen objects and domains.
Abstract
Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Machine Learning and Algorithms · Domain Adaptation and Few-Shot Learning
MethodsFocus
