An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text   Encoders for Composed Image Retrieval

Jaeseok Byun; Seokhyeon Jeong; Wonjae Kim; Sanghyuk Chun; Taesup Moon

arXiv:2406.09188·cs.CV·March 19, 2025

An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval

Jaeseok Byun, Seokhyeon Jeong, Wonjae Kim, Sanghyuk Chun, Taesup Moon

PDF

Open Access 1 Repo

TL;DR

This paper introduces RTD, a post-hoc framework that reduces task discrepancy in text encoders for composed image retrieval, significantly improving performance with minimal additional training.

Contribution

The authors propose RTD, a novel text-only contrastive learning method that enhances text encoder performance in CIR, addressing task discrepancy without extensive retraining.

Findings

01

RTD improves CIR performance to surpass triplet-based methods.

02

RTD achieves comparable results with only 23 minutes of additional training.

03

RTD is up to 100 times faster in training than existing approaches.

Abstract

Composed Image Retrieval (CIR) aims to retrieve a target image based on a reference image and conditioning text, enabling controllable image searches. The mainstream Zero-Shot (ZS) CIR methods bypass the need for expensive training CIR triplets by projecting image embeddings into the text token embedding space, forming a composed query for retrieval. However, we highlight an inherent limitation in these projection-based CIR: a task discrepancy of text encoders between the original pre-training task of the encoders (text $\leftrightarrow$ image) and the target CIR task (image + text $\leftrightarrow$ image), which potentially negatively impacts CIR performance. To reduce such a discrepancy, a naive solution would be to train both image and text encoders with CIR triplets in a supervised manner. Instead, we introduce Reducing Task Discrepancy of Text Encoders (RTD), an efficient text-only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

navervision/lincir
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications

MethodsContrastive Learning · Contrastive Language-Image Pre-training