MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval
Jeong-Woo Park, Seong-Whan Lee

TL;DR
MCoT-RE introduces a training-free, multi-faceted chain-of-thought and re-ranking framework for zero-shot composed image retrieval, effectively balancing visual context and textual modifications to improve retrieval accuracy.
Contribution
It presents a novel two-stage, training-free approach that enhances zero-shot CIR by generating dual captions and applying multi-grained re-ranking, outperforming existing methods.
Findings
Achieves up to 6.24% improvement in Recall@10 on FashionIQ.
Achieves up to 8.58% improvement in Recall@1 on CIRR.
Outperforms existing training-free methods in zero-shot CIR tasks.
Abstract
Composed Image Retrieval (CIR) is the task of retrieving a target image from a gallery using a composed query consisting of a reference image and a modification text. Among various CIR approaches, training-free zero-shot methods based on pre-trained models are cost-effective but still face notable limitations. For example, sequential VLM-LLM pipelines process each modality independently, which often results in information loss and limits cross-modal interaction. In contrast, methods based on multimodal large language models (MLLMs) often focus exclusively on applying changes indicated by the text, without fully utilizing the contextual visual information from the reference image. To address these issues, we propose multi-faceted Chain-of-Thought with re-ranking (MCoT-RE), a training-free zero-shot CIR framework. MCoT-RE utilizes multi-faceted Chain-of-Thought to guide the MLLM to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications
