Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining
Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Minlong Lu, Yichi, Zhang, Hang Xu, Xiaodan Liang

TL;DR
This paper introduces Product1M, a large-scale multi-modal dataset for weakly supervised instance-level product retrieval, and proposes CAPTURE, a transformer-based model that effectively leverages multi-modal data for fine-grained product identification.
Contribution
The paper presents a new large-scale dataset, Product1M, and a novel model, CAPTURE, for weakly supervised multi-modal instance-level product retrieval, addressing real-world complexities.
Findings
CAPTURE outperforms state-of-the-art baselines.
Product1M contains over 1 million image-caption pairs.
Extensive ablations confirm model effectiveness.
Abstract
Nowadays, customer's demands for E-commerce are more diversified, which introduces more complications to the product retrieval industry. Previous methods are either subject to single-modal input or perform supervised image-level product retrieval, thus fail to accommodate real-life scenarios where enormous weakly annotated multi-modal data are present. In this paper, we investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval among fine-grained product categories. To promote the study of this challenging task, we contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval. Notably, Product1M contains over 1 million image-caption pairs and consists of two sample types, i.e., single-product and multi-product samples, which encompass a wide variety of cosmetics brands. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Residual Connection · Softmax · Dropout · Adam
