MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval

Xuri Ge; Chunhao Wang; Xindi Wang; Zheyun Qin; Zhumin Chen; Xin Xin

arXiv:2603.17360·cs.CV·March 19, 2026

MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval

Xuri Ge, Chunhao Wang, Xindi Wang, Zheyun Qin, Zhumin Chen, Xin Xin

PDF

Open Access

TL;DR

This paper introduces MCoT-MVS, a novel multi-level vision selection method guided by multi-modal reasoning, significantly improving composed image retrieval accuracy by effectively extracting and fusing visual and textual cues.

Contribution

The paper proposes a multi-modal chain-of-thought reasoning framework that enhances visual cue selection and fusion for improved composed image retrieval performance.

Findings

01

Achieves state-of-the-art results on CIRR and FashionIQ benchmarks.

02

Effectively extracts discriminative visual features guided by reasoning cues.

03

Outperforms existing methods consistently across multiple metrics.

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images based on a reference image and modified texts. However, existing methods often struggle to extract the correct semantic cues from the reference image that best reflect the user's intent under textual modification prompts, resulting in interference from irrelevant visual noise. In this paper, we propose a novel Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning (MCoT-MVS) for CIR, integrating attention-aware multi-level vision features guided by reasoning cues from a multi-modal large language model (MLLM). Specifically, we leverage an MLLM to perform chain-of-thought reasoning on the multimodal composed input, generating the retained, removed, and target-inferred texts. These textual cues subsequently guide two reference visual attention selection modules to selectively extract discriminative patch-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques