SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot   Compositional Retrieval

Bhavin Jawade; Joao V. B. Soares; Kapil Thadani; Deen Dayal Mohan,; Amir Erfan Eshratifar; Benjamin Culpepper; Paloma de Juan; Srirangaraj; Setlur; Venu Govindaraju

arXiv:2501.08347·cs.CV·January 16, 2025

SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

Bhavin Jawade, Joao V. B. Soares, Kapil Thadani, Deen Dayal Mohan,, Amir Erfan Eshratifar, Benjamin Culpepper, Paloma de Juan, Srirangaraj, Setlur, Venu Govindaraju

PDF

Open Access

TL;DR

SCOT introduces a self-supervised contrastive pretraining approach for zero-shot compositional image retrieval, leveraging large image-text datasets and language models to improve generalization without labeled triplets.

Contribution

The paper presents a novel zero-shot pretraining method for compositional retrieval that eliminates the need for labor-intensive triplet datasets by using large language models for supervision.

Findings

01

Outperforms state-of-the-art zero-shot methods on FashionIQ and CIRR.

02

Achieves competitive results compared to fully-supervised models.

03

Demonstrates strong generalization to unseen objects and domains.

Abstract

Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Machine Learning and Algorithms · Domain Adaptation and Few-Shot Learning

MethodsFocus