Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval
Xiao Dong, Xunlin Zhan, Yunchao Wei, Xiaoyong Wei, Yaowei Wang,, Minlong Lu, Xiaochun Cao, Xiaodan Liang

TL;DR
This paper introduces a novel entity-graph enhanced cross-modal pretraining model for fine-grained, instance-level product retrieval, leveraging multi-modal data and entity knowledge to improve accuracy in realistic weakly-supervised environments.
Contribution
It proposes the EGE-CMP model that explicitly integrates entity graph information into cross-modal pretraining for improved product retrieval accuracy.
Findings
EGE-CMP outperforms SOTA models like CLIP, UNITER, and CAPTURE.
The model effectively reduces confusion between different objects.
Experimental results demonstrate the model's generalizability and efficacy.
Abstract
Our goal in this research is to study a more realistic environment in which we can conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories. We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks to enable the evaluations on the price comparison and personalized recommendations. For both instance-level tasks, how to accurately pinpoint the product target mentioned in the visual-linguistic data and effectively decrease the influence of irrelevant contents is quite challenging. To address this, we exploit to train a more effective cross-modal pertaining model which is adaptively capable of incorporating key concept information from the multi-modal data, by using an entity graph whose node and edge respectively denote the entity and the similarity relation between entities. Specifically, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Web Data Mining and Analysis
MethodsUNiversal Image-TExt Representation Learning · Contrastive Language-Image Pre-training
