WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

Tianyue Wang; Leigang Qu; Tianyu Yang; Xiangzhao Hao; Yifan Xu; Haiyun Guo; Jinqiao Wang

arXiv:2602.23029·cs.CV·March 25, 2026

WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

Tianyue Wang, Leigang Qu, Tianyu Yang, Xiangzhao Hao, Yifan Xu, Haiyun Guo, Jinqiao Wang

PDF

Open Access

TL;DR

WISER is a training-free framework for zero-shot composed image retrieval that combines wider search, adaptive fusion, and iterative refinement to effectively handle diverse query intents and outperform existing methods.

Contribution

WISER introduces a novel unified retrieval framework that integrates T2I and I2I modalities with intent and uncertainty modeling, enabling superior zero-shot retrieval without training.

Findings

01

Achieves 45% relative improvement on CIRCO (mAP@5).

02

Achieves 57% relative improvement on CIRR (Recall@1).

03

Outperforms many training-dependent methods.

Abstract

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques