Language-only Efficient Training of Zero-shot Composed Image Retrieval
Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang and, Sangdoo Yun

TL;DR
This paper introduces LinCIR, a language-only training framework for zero-shot composed image retrieval that uses self-supervision to achieve high performance without requiring triplet datasets, significantly reducing training time.
Contribution
LinCIR is a novel zero-shot CIR method trained solely on text data using self-masking projection, enhancing scalability and generalizability over existing approaches.
Findings
Trained LinCIR in 48 minutes with CLIP ViT-G backbone.
Achieved state-of-the-art zero-shot performance on four CIR benchmarks.
Outperformed supervised methods on FashionIQ dataset.
Abstract
Composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions. Conventional CIR approaches need a training dataset composed of triplets of query image, query text, and target image, which is very expensive to collect. Several recent works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue without using pre-collected triplets. However, the existing ZS-CIR methods show limited backbone scalability and generalizability due to the lack of diversity of the input texts during training. We propose a novel CIR framework, only using language for its training. Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP). We project the text latent embedding to the token embedding space and construct a new text by replacing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
