Modality and Task Adaptation for Enhanced Zero-shot Composed Image Retrieval
Haiwen Li, Fei Su, Zhicheng Zhao

TL;DR
This paper introduces a lightweight, post-hoc framework for zero-shot composed image retrieval that leverages large language models and a novel adapter to improve modality and task adaptation, achieving state-of-the-art results.
Contribution
It proposes a new triplet construction pipeline and the MoTa-Adapter for effective, parameter-efficient fine-tuning in ZS-CIR tasks, addressing modality and task discrepancies.
Findings
Significant performance improvements on four benchmarks.
Achieved state-of-the-art results with inversion-based methods.
Effective handling of challenging samples through entropy-based optimization.
Abstract
As a challenging vision-language task, Zero-Shot Composed Image Retrieval (ZS-CIR) is designed to retrieve target images using bi-modal (image+text) queries. Typical ZS-CIR methods employ an inversion network to generate pseudo-word tokens that effectively represent the input semantics. However, the inversion-based methods suffer from two inherent issues: First, the task discrepancy exists because inversion training and CIR inference involve different objectives. Second, the modality discrepancy arises from the input feature distribution mismatch between training and inference. To this end, we propose a lightweight post-hoc framework, consisting of two components: (1) A new text-anchored triplet construction pipeline leverages a large language model (LLM) to transform a standard image-text dataset into a triplet dataset, where a textual description serves as the target of each triplet.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications
