XR: Cross-Modal Agents for Composed Image Retrieval

Zhongyu Yang; Wei Pang; Yingfang Yuan

arXiv:2601.14245·cs.IR·March 2, 2026

XR: Cross-Modal Agents for Composed Image Retrieval

Zhongyu Yang, Wei Pang, Yingfang Yuan

PDF

Open Access

TL;DR

XR introduces a multi-agent, training-free framework for composed image retrieval that combines generative, matching, and reasoning agents to improve semantic and visual accuracy significantly.

Contribution

It proposes a novel multi-agent, training-free approach for composed image retrieval, enhancing semantic understanding and reasoning over existing embedding-based methods.

Findings

01

Achieves up to 38% improvement over baselines.

02

Each agent component is essential for performance.

03

Effective across multiple datasets: FashionIQ, CIRR, CIRCO.

Abstract

Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding-based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross-modal cues and lacking semantic reasoning. To address these limitations, we introduce XR, a training-free multi-agent framework that reframes retrieval as a progressively coordinated reasoning process. It orchestrates three specialized types of agents: imagination agents synthesize target representations through cross-modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through targeted reasoning for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques