Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for   Training-Free Zero-Shot Composed Image Retrieval

Yuanmin Tang; Xiaoting Qin; Jue Zhang; Jing Yu; Gaopeng Gou; Gang; Xiong; Qingwei Ling; Saravan Rajmohan; Dongmei Zhang; Qi Wu

arXiv:2412.11077·cs.CV·December 23, 2024

Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval

Yuanmin Tang, Xiaoting Qin, Jue Zhang, Jing Yu, Gaopeng Gou, Gang, Xiong, Qingwei Ling, Saravan Rajmohan, Dongmei Zhang, Qi Wu

PDF

Open Access 1 Repo

TL;DR

This paper introduces OSrCIR, a training-free, one-stage reasoning method using multimodal large language models for zero-shot composed image retrieval, improving accuracy by retaining visual details and enhancing interpretability.

Contribution

The paper presents a novel one-stage, training-free Reflective Chain-of-Thought framework that outperforms existing two-stage methods in zero-shot composed image retrieval.

Findings

01

Achieves 1.80% to 6.44% performance improvements over existing methods.

02

Sets new state-of-the-art results in zero-shot composed image retrieval.

03

Enhances interpretability through reflective reasoning aligning manipulation intent with reference image cues.

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images that closely resemble a reference image while integrating user-specified textual modifications, thereby capturing user intent more precisely. Existing training-free zero-shot CIR (ZS-CIR) methods often employ a two-stage process: they first generate a caption for the reference image and then use Large Language Models for reasoning to obtain a target description. However, these methods suffer from missing critical visual details and limited reasoning capabilities, leading to suboptimal retrieval performance. To address these challenges, we propose a novel, training-free one-stage method, One-Stage Reflective Chain-of-Thought Reasoning for ZS-CIR (OSrCIR), which employs Multimodal Large Language Models to retain essential visual information in a single-stage reasoning process, eliminating the information loss seen in two-stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Pter61/osrcir
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications