Zero Shot Composed Image Retrieval

Santhosh Kakarla; Gautama Shastry Bulusu Venkata

arXiv:2506.06602·cs.CV·June 10, 2025

Zero Shot Composed Image Retrieval

Santhosh Kakarla, Gautama Shastry Bulusu Venkata

PDF

Open Access

TL;DR

This paper enhances zero-shot composed image retrieval by fine-tuning vision-language models with multimodal fusion techniques, significantly improving retrieval accuracy over baseline methods and analyzing the limitations of alternative approaches like Retrieval-DPO.

Contribution

It introduces a lightweight fine-tuning method with a Q-Former for improved multimodal fusion in zero-shot CIR, and critically evaluates Retrieval-DPO's limitations for this task.

Findings

01

Fine-tuning BLIP-2 with a Q-Former boosts Recall@10 to over 45%.

02

Retrieval-DPO performs poorly due to lack of fusion and unsuitable training objectives.

03

Effective zero-shot CIR requires multimodal fusion, ranking-aware loss, and high-quality negatives.

Abstract

Composed image retrieval (CIR) allows a user to locate a target image by applying a fine-grained textual edit (e.g., ``turn the dress blue'' or ``remove stripes'') to a reference image. Zero-shot CIR, which embeds the image and the text with separate pretrained vision-language encoders, reaches only 20-25\% Recall@10 on the FashionIQ benchmark. We improve this by fine-tuning BLIP-2 with a lightweight Q-Former that fuses visual and textual features into a single embedding, raising Recall@10 to 45.6\% (shirt), 40.1\% (dress), and 50.4\% (top-tee) and increasing the average Recall@50 to 67.6\%. We also examine Retrieval-DPO, which fine-tunes CLIP's text encoder with a Direct Preference Optimization loss applied to FAISS-mined hard negatives. Despite extensive tuning of the scaling factor, index, and sampling strategy, Retrieval-DPO attains only 0.02\% Recall@10 -- far below zero-shot and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques