Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval
Yucheng Suo, Fan Ma, Linchao Zhu, Yi Yang

TL;DR
This paper introduces KEDs, a knowledge-enhanced dual-stream framework for zero-shot composed image retrieval that models detailed attributes and aligns visual and textual semantics, outperforming previous methods.
Contribution
The paper proposes a novel dual-stream framework that incorporates a database for attribute modeling and aligns pseudo-word tokens with textual concepts in a zero-shot setting.
Findings
KEDs outperforms previous zero-shot CIR methods on multiple benchmarks.
The framework effectively models detailed attributes like color and layout.
Explicit alignment of visual tokens with text improves retrieval accuracy.
Abstract
We study the zero-shot Composed Image Retrieval (ZS-CIR) task, which is to retrieve the target image given a reference image and a description without training on the triplet datasets. Previous works generate pseudo-word tokens by projecting the reference image features to the text embedding space. However, they focus on the global visual representation, ignoring the representation of detailed attributes, e.g., color, object number and layout. To address this challenge, we propose a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs). KEDs implicitly models the attributes of the reference images by incorporating a database. The database enriches the pseudo-word tokens by providing relevant images and captions, emphasizing shared attribute information in various aspects. In this way, KEDs recognizes the reference image from diverse perspectives. Moreover,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications
MethodsFocus
