Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration

Yongcong Ye; Kai Zhang; Yanghai Zhang; Enhong Chen; Longfei Li; Jun Zhou

arXiv:2601.14060·cs.CV·January 21, 2026

Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration

Yongcong Ye, Kai Zhang, Yanghai Zhang, Enhong Chen, Longfei Li, Jun Zhou

PDF

Open Access

TL;DR

This paper introduces CVSI, a novel zero-shot image retrieval method that effectively captures fine-grained visual and semantic modifications by integrating complementary visual and textual information, outperforming existing methods.

Contribution

CVSI is the first approach to combine visual and semantic information extraction with complementary retrieval for zero-shot composed image retrieval.

Findings

01

CVSI significantly outperforms state-of-the-art methods on CIRR, CIRCO, and FashionIQ datasets.

02

The method effectively captures fine-grained modifications in zero-shot scenarios.

03

Extensive experiments validate the robustness and superiority of CVSI.

Abstract

Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications, allowing users to retrieve a target image by providing a reference image and a relative caption describing the desired modifications. Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively. They primarily rely on either transforming the multimodal query into a single text using image-to-text models or employing large language models for target image description generation, approaches that often fail to capture complementary visual information and complete semantic context. To address these limitations, we propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration (CVSI). Specifically, CVSI leverages three key components: (1) Visual Information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques