TL;DR
This paper introduces K3M, a multi-modal pretraining method for E-commerce product data that incorporates knowledge modality to improve robustness against missing or noisy modalities, leading to better performance.
Contribution
K3M is a novel multi-modal pretraining approach that integrates knowledge modality and models modality interactions to enhance robustness in E-commerce scenarios.
Findings
K3M outperforms baseline methods under modality-noise conditions.
K3M achieves significant improvements on real-world E-commerce datasets.
The method effectively handles missing and noisy modalities in product data.
Abstract
In this paper, we address multi-modal pretraining of product data in the field of E-commerce. Current multi-modal pretraining methods proposed for image and text modalities lack robustness in the face of modality-missing and modality-noise, which are two pervasive problems of multi-modal product data in real E-commerce scenarios. To this end, we propose a novel method, K3M, which introduces knowledge modality in multi-modal pretraining to correct the noise and supplement the missing of image and text modalities. The modal-encoding layer extracts the features of each modality. The modal-interaction layer is capable of effectively modeling the interaction of multiple modalities, where an initial-interactive feature fusion model is designed to maintain the independence of image modality and text modality, and a structure aggregation module is designed to fuse the information of image,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsK3M
