CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free   Zero-Shot Composed Image Retrieval

Zelong Sun; Dong Jing; Zhiwu Lu

arXiv:2502.20826·cs.CV·March 3, 2025

CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval

Zelong Sun, Dong Jing, Zhiwu Lu

PDF

TL;DR

CoTMR introduces a training-free, multi-scale reasoning framework using large vision-language models and step-by-step inference to improve zero-shot composed image retrieval, outperforming previous methods and enhancing interpretability.

Contribution

It proposes CoTMR, a novel training-free approach employing chain-of-thought and multi-scale reasoning with LVLMs for ZS-CIR, addressing limitations of existing caption-based methods.

Findings

01

Significantly outperforms previous methods on four benchmarks.

02

Provides more interpretable and reliable reasoning process.

03

Demonstrates effectiveness of multi-scale reasoning in image retrieval.

Abstract

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images by integrating information from a composed query (reference image and modification text) without training samples. Existing methods primarily combine caption models and large language models (LLMs) to generate target captions based on composed queries but face various issues such as incompatibility, visual information loss, and insufficient reasoning. In this work, we propose CoTMR, a training-free framework crafted for ZS-CIR with novel Chain-of-thought (CoT) and Multi-scale Reasoning. Instead of relying on caption models for modality transformation, CoTMR employs the Large Vision-Language Model (LVLM) to achieve unified understanding and reasoning for composed queries. To enhance the reasoning reliability, we devise CIRCoT, which guides the LVLM through a step-by-step inference process using predefined…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training · Focus