Sentence-level Prompts Benefit Composed Image Retrieval
Yang Bai, Xinxing Xu, Yong Liu, Salman Khan, Fahad Khan, Wangmeng Zuo,, Rick Siow Mong Goh, Chun-Mei Feng

TL;DR
This paper introduces a novel approach for composed image retrieval that uses learned sentence-level prompts with pretrained vision-language models, outperforming existing methods especially in complex image modifications.
Contribution
It proposes learning sentence-level prompts instead of pseudo-words for improved composed image retrieval, leveraging pretrained models and novel loss functions.
Findings
Outperforms state-of-the-art on Fashion-IQ and CIRR datasets.
Effective in handling complex image modifications.
Uses sentence-level prompts with contrastive and alignment losses.
Abstract
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption. Most existing CIR models adopt the late-fusion strategy to combine visual and language features. Besides, several approaches have also been suggested to generate a pseudo-word token from the reference image, which is further integrated into the relative caption for CIR. However, these pseudo-word-based prompting methods have limitations when target image encompasses complex changes on reference image, e.g., object removal and attribute modification. In this work, we demonstrate that learning an appropriate sentence-level prompt for the relative caption (SPRC) is sufficient for achieving effective composed image retrieval. Instead of relying on pseudo-word-based prompts, we propose to leverage pretrained V-L models, e.g., BLIP-2, to…
Peer Reviews
Decision·ICLR 2024 spotlight
- The proposed method is technically sound and simple. - The paper is easy to follow, experiments are thorough along with ablations (such as prompt length, weight in the loss function etc). - Provides SOTA results against 10+ baselines on two public datasets. - To be open-sourced.
- Limited novelty: A recent paper (https://arxiv.org/pdf/2310.09291.pdf) with quite similar methodology and motivations except for a nuanced difference: training-free vs learned sentence level prompts. There is a need for contextualizing these methods together, ideally under the same evaluation framework so that we understand the value of learned sentence level prompts proposed by this paper. - Experimental setup: CIRR dataset experiments uses a random split of the training dataset as the test
1. Comprehensive literature survey and good motivation of the problem. 2. Sound approach based on innovative loss functions. 3. Good results that exceed the state of the art.
1. The overall innovation could be seen as modest. However, I am open to being convinced otherwise.
1. It is reasonable to generate sentence-level prompts depending on both reference image and relative caption to enrich the expressivity. 2. Moreover, the experiments are solid since the authors conduct a thorough comparison with previous methods. 3. The paper is well-written and the idea is easy to follow.
1. in my comprehension, the sentence-level prompts are actually latent vectors output from the MLP layer, so it is hard to make sure the prompts work as expected as demonstrated in Figure 1(c), i.e., decoupling the multiple objects and attributes of query image, and correctly integrating the process of object removal or attribute modification. 2. It is difficult to understand the pi’ in prompt alignment loss. Whether each reference image has an auxiliary text prompt? As a result, the Figure 2(a
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
