Zero-shot Composed Text-Image Retrieval
Yikun Liu, Jiangchao Yao, Ya Zhang, Yanfeng Wang, Weidi, Xie

TL;DR
This paper introduces a scalable pipeline for constructing datasets for zero-shot composed image retrieval, and proposes a transformer-based model that effectively fuses multi-modal information to improve retrieval accuracy.
Contribution
It presents a novel data construction method from large-scale image-text datasets and a transformer-based adaptive fusion model for zero-shot CIR.
Findings
Achieves comparable or superior performance to SOTA models on benchmarks.
Demonstrates the effectiveness of automatic dataset construction for zero-shot learning.
Validates the proposed model's ability to adaptively fuse multi-modal information.
Abstract
In this paper, we consider the problem of composed image retrieval (CIR), it aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability. We make the following contributions: (i) we initiate a scalable pipeline to automatically construct datasets for training CIR model, by simply exploiting a large-scale dataset of image-text pairs, e.g., a subset of LAION-5B; (ii) we introduce a transformer-based adaptive aggregation model, TransAgg, which employs a simple yet efficient fusion mechanism, to adaptively combine information from diverse modalities; (iii) we conduct extensive ablation studies to investigate the usefulness of our proposed data construction procedure, and the effectiveness of core components in TransAgg; (iv) when evaluating on the publicly available benckmarks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
