Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval

Rong-Cheng Tu; Wenhao Sun; Hanzhe You; Yingjie Wang; Jiaxing Huang; Li Shen; Dacheng Tao

arXiv:2505.19952·cs.CV·May 27, 2025

Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval

Rong-Cheng Tu, Wenhao Sun, Hanzhe You, Yingjie Wang, Jiaxing Huang, Li Shen, Dacheng Tao

PDF

Open Access

TL;DR

This paper introduces a Multimodal Reasoning Agent that improves zero-shot composed image retrieval by directly constructing triplets without relying on intermediate textual representations, leading to significant performance gains.

Contribution

The novel framework eliminates intermediate text reliance by directly learning from synthetic triplets, enhancing zero-shot image retrieval accuracy.

Findings

01

Improves R@10 by 7.5% on FashionIQ

02

Boosts R@1 by 9.6% on CIRR

03

Increases mAP@5 by 9.5% on CIRCO

Abstract

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a compositional query, consisting of a reference image and a modifying text-without relying on annotated training data. Existing approaches often generate a synthetic target text using large language models (LLMs) to serve as an intermediate anchor between the compositional query and the target image. Models are then trained to align the compositional query with the generated text, and separately align images with their corresponding texts using contrastive learning. However, this reliance on intermediate text introduces error propagation, as inaccuracies in query-to-text and text-to-image mappings accumulate, ultimately degrading retrieval performance. To address these problems, we propose a novel framework by employing a Multimodal Reasoning Agent (MRA) for ZS-CIR. MRA eliminates the dependence on textual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications

MethodsALIGN