Language-only Efficient Training of Zero-shot Composed Image Retrieval

Geonmo Gu; Sanghyuk Chun; Wonjae Kim; Yoohoon Kang and; Sangdoo Yun

arXiv:2312.01998·cs.CV·April 2, 2024·1 cites

Language-only Efficient Training of Zero-shot Composed Image Retrieval

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang and, Sangdoo Yun

PDF

Open Access 1 Repo

TL;DR

This paper introduces LinCIR, a language-only training framework for zero-shot composed image retrieval that uses self-supervision to achieve high performance without requiring triplet datasets, significantly reducing training time.

Contribution

LinCIR is a novel zero-shot CIR method trained solely on text data using self-masking projection, enhancing scalability and generalizability over existing approaches.

Findings

01

Trained LinCIR in 48 minutes with CLIP ViT-G backbone.

02

Achieved state-of-the-art zero-shot performance on four CIR benchmarks.

03

Outperformed supervised methods on FashionIQ dataset.

Abstract

Composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions. Conventional CIR approaches need a training dataset composed of triplets of query image, query text, and target image, which is very expensive to collect. Several recent works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue without using pre-collected triplets. However, the existing ZS-CIR methods show limited backbone scalability and generalizability due to the lack of diversity of the input texts during training. We propose a novel CIR framework, only using language for its training. Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP). We project the text latent embedding to the token embedding space and construct a new text by replacing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

navervision/lincir
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training