Zero-shot Composed Text-Image Retrieval

Yikun Liu; Jiangchao Yao; Ya Zhang; Yanfeng Wang; Weidi; Xie

arXiv:2306.07272·cs.CV·March 7, 2024·1 cites

Zero-shot Composed Text-Image Retrieval

Yikun Liu, Jiangchao Yao, Ya Zhang, Yanfeng Wang, Weidi, Xie

PDF

Open Access 1 Repo

TL;DR

This paper introduces a scalable pipeline for constructing datasets for zero-shot composed image retrieval, and proposes a transformer-based model that effectively fuses multi-modal information to improve retrieval accuracy.

Contribution

It presents a novel data construction method from large-scale image-text datasets and a transformer-based adaptive fusion model for zero-shot CIR.

Findings

01

Achieves comparable or superior performance to SOTA models on benchmarks.

02

Demonstrates the effectiveness of automatic dataset construction for zero-shot learning.

03

Validates the proposed model's ability to adaptively fuse multi-modal information.

Abstract

In this paper, we consider the problem of composed image retrieval (CIR), it aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability. We make the following contributions: (i) we initiate a scalable pipeline to automatically construct datasets for training CIR model, by simply exploiting a large-scale dataset of image-text pairs, e.g., a subset of LAION-5B; (ii) we introduce a transformer-based adaptive aggregation model, TransAgg, which employs a simple yet efficient fusion mechanism, to adaptively combine information from diverse modalities; (iii) we conduct extensive ablation studies to investigate the usefulness of our proposed data construction procedure, and the effectiveness of core components in TransAgg; (iv) when evaluating on the publicly available benckmarks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Code-kunkun/ZS-CIR
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques