Towards Bridging the Cross-modal Semantic Gap for Multi-modal Recommendation
Xinglong Wu, Anfeng Huang, Hongwei Yang, Hui He, Yu Tai, Weizhe Zhang

TL;DR
This paper introduces CLIPER, a multi-modal recommendation framework that leverages cross-modal alignment to bridge semantic gaps and improve recommendation accuracy by extracting fine-grained semantic information.
Contribution
The paper proposes a novel multi-view modality-alignment approach using CLIP to better capture cross-modal semantics for recommendation tasks.
Findings
CLIPER outperforms existing models on three public datasets.
The multi-view alignment improves semantic representation quality.
The approach effectively bridges the semantic gap across modalities.
Abstract
Multi-modal recommendation greatly enhances the performance of recommender systems by modeling the auxiliary information from multi-modality contents. Most existing multi-modal recommendation models primarily exploit multimedia information propagation processes to enrich item representations and directly utilize modal-specific embedding vectors independently obtained from upstream pre-trained models. However, this might be inappropriate since the abundant task-specific semantics remain unexplored, and the cross-modality semantic gap hinders the recommendation performance. Inspired by the recent progress of the cross-modal alignment model CLIP, in this paper, we propose a novel \textbf{CLIP} \textbf{E}nhanced \textbf{R}ecommender (\textbf{CLIPER}) framework to bridge the semantic gap between modalities and extract fine-grained multi-view semantic information. Specifically, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Recommender Systems and Techniques · Multimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
