An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval
Jaeseok Byun, Seokhyeon Jeong, Wonjae Kim, Sanghyuk Chun, Taesup Moon

TL;DR
This paper introduces RTD, a post-hoc framework that reduces task discrepancy in text encoders for composed image retrieval, significantly improving performance with minimal additional training.
Contribution
The authors propose RTD, a novel text-only contrastive learning method that enhances text encoder performance in CIR, addressing task discrepancy without extensive retraining.
Findings
RTD improves CIR performance to surpass triplet-based methods.
RTD achieves comparable results with only 23 minutes of additional training.
RTD is up to 100 times faster in training than existing approaches.
Abstract
Composed Image Retrieval (CIR) aims to retrieve a target image based on a reference image and conditioning text, enabling controllable image searches. The mainstream Zero-Shot (ZS) CIR methods bypass the need for expensive training CIR triplets by projecting image embeddings into the text token embedding space, forming a composed query for retrieval. However, we highlight an inherent limitation in these projection-based CIR: a task discrepancy of text encoders between the original pre-training task of the encoders (text image) and the target CIR task (image + text image), which potentially negatively impacts CIR performance. To reduce such a discrepancy, a naive solution would be to train both image and text encoders with CIR triplets in a supervised manner. Instead, we introduce Reducing Task Discrepancy of Text Encoders (RTD), an efficient text-only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications
MethodsContrastive Learning · Contrastive Language-Image Pre-training
