Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product   Retrieval

Xiao Dong; Xunlin Zhan; Yunchao Wei; Xiaoyong Wei; Yaowei Wang,; Minlong Lu; Xiaochun Cao; Xiaodan Liang

arXiv:2206.08842·cs.MM·June 20, 2022

Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval

Xiao Dong, Xunlin Zhan, Yunchao Wei, Xiaoyong Wei, Yaowei Wang,, Minlong Lu, Xiaochun Cao, Xiaodan Liang

PDF

Open Access

TL;DR

This paper introduces a novel entity-graph enhanced cross-modal pretraining model for fine-grained, instance-level product retrieval, leveraging multi-modal data and entity knowledge to improve accuracy in realistic weakly-supervised environments.

Contribution

It proposes the EGE-CMP model that explicitly integrates entity graph information into cross-modal pretraining for improved product retrieval accuracy.

Findings

01

EGE-CMP outperforms SOTA models like CLIP, UNITER, and CAPTURE.

02

The model effectively reduces confusion between different objects.

03

Experimental results demonstrate the model's generalizability and efficacy.

Abstract

Our goal in this research is to study a more realistic environment in which we can conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories. We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks to enable the evaluations on the price comparison and personalized recommendations. For both instance-level tasks, how to accurately pinpoint the product target mentioned in the visual-linguistic data and effectively decrease the influence of irrelevant contents is quite challenging. To address this, we exploit to train a more effective cross-modal pertaining model which is adaptively capable of incorporating key concept information from the multi-modal data, by using an entity graph whose node and edge respectively denote the entity and the similarity relation between entities. Specifically, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Web Data Mining and Analysis

MethodsUNiversal Image-TExt Representation Learning · Contrastive Language-Image Pre-training