XR: Cross-Modal Agents for Composed Image Retrieval
Zhongyu Yang, Wei Pang, Yingfang Yuan

TL;DR
XR introduces a multi-agent, training-free framework for composed image retrieval that combines generative, matching, and reasoning agents to improve semantic and visual accuracy significantly.
Contribution
It proposes a novel multi-agent, training-free approach for composed image retrieval, enhancing semantic understanding and reasoning over existing embedding-based methods.
Findings
Achieves up to 38% improvement over baselines.
Each agent component is essential for performance.
Effective across multiple datasets: FashionIQ, CIRR, CIRCO.
Abstract
Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding-based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross-modal cues and lacking semantic reasoning. To address these limitations, we introduce XR, a training-free multi-agent framework that reframes retrieval as a progressively coordinated reasoning process. It orchestrates three specialized types of agents: imagination agents synthesize target representations through cross-modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through targeted reasoning for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
