Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed   Image Retrieval

Young Kyun Jang; Dat Huynh; Ashish Shah; Wen-Kai Chen; Ser-Nam Lim

arXiv:2405.00571·cs.CV·May 2, 2024·1 cites

Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval

Young Kyun Jang, Dat Huynh, Ashish Shah, Wen-Kai Chen, Ser-Nam Lim

PDF

Open Access

TL;DR

This paper introduces a novel zero-shot composed image retrieval method using spherical linear interpolation and text-anchored tuning, achieving state-of-the-art results without relying on expensive annotated datasets.

Contribution

The paper proposes a new zero-shot CIR approach that directly merges image and text representations with Slerp and employs Text-Anchored-Tuning to improve modality alignment.

Findings

01

Achieves state-of-the-art performance on CIR benchmarks.

02

TAT improves the effectiveness of Slerp by reducing modality gap.

03

Method is efficient and serves as a strong initial checkpoint for supervised models.

Abstract

Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption that describes desired modifications to that image. Supervised CIR approaches have shown strong performance, but their reliance on expensive manually-annotated datasets restricts their scalability and broader applicability. To address these issues, previous studies have proposed pseudo-word token-based Zero-Shot CIR (ZS-CIR) methods, which utilize a projection module to map images to word tokens. However, we conjecture that this approach has a downside: the projection module distorts the original image representation and confines the resulting composed embeddings to the text-side. In order to resolve this, we introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations by identifying an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications