Zero-shot Composed Image Retrieval Considering Query-target Relationship   Leveraging Masked Image-text Pairs

Huaying Zhang; Rintaro Yanagi; Ren Togo; Takahiro Ogawa; Miki Haseyama

arXiv:2406.18836·cs.CV·June 28, 2024

Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs

Huaying Zhang, Rintaro Yanagi, Ren Togo, Takahiro Ogawa, Miki Haseyama

PDF

Open Access

TL;DR

This paper introduces a zero-shot composed image retrieval method that leverages masked image-text pairs to better understand query-target relationships, improving retrieval accuracy without task-specific training.

Contribution

The paper presents an end-to-end training approach for zero-shot CIR using masked image-text pairs, explicitly modeling query-target relationships for improved retrieval.

Findings

01

Effective zero-shot CIR achieved with masked image-text pairs

02

End-to-end training improves query-target relationship modeling

03

Experimental results demonstrate superior retrieval performance

Abstract

This paper proposes a novel zero-shot composed image retrieval (CIR) method considering the query-target relationship by masked image-text pairs. The objective of CIR is to retrieve the target image using a query image and a query text. Existing methods use a textual inversion network to convert the query image into a pseudo word to compose the image and text and use a pre-trained visual-language model to realize the retrieval. However, they do not consider the query-target relationship to train the textual inversion network to acquire information for retrieval. In this paper, we propose a novel zero-shot CIR method that is trained end-to-end using masked image-text pairs. By exploiting the abundant image-text pairs that are convenient to obtain with a masking strategy for learning the query-target relationship, it is expected that accurate zero-shot CIR using a retrieval-focused…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications