TL;DR
This paper introduces a novel multi-modal self-supervised learning method for recommendation systems, leveraging cross-modal dependencies and user preferences to improve multimedia recommendation accuracy.
Contribution
It proposes a new MMSSL framework that uses adversarial perturbations and contrastive learning to better model user preferences and cross-modal relationships in recommendation tasks.
Findings
Outperforms state-of-the-art baselines on real-world datasets.
Effectively captures inter-modal semantic commonality and user preference diversity.
Demonstrates robustness with limited labeled data.
Abstract
The online emergence of multi-modal sharing platforms (eg, TikTok, Youtube) is powering personalized recommender systems to incorporate various modalities (eg, visual, textual and acoustic) into the latent user representations. While existing works on multi-modal recommendation exploit multimedia content features in enhancing item embeddings, their model representation capability is limited by heavy label reliance and weak robustness on sparse user behavior data. Inspired by the recent progress of self-supervised learning in alleviating label scarcity issue, we explore deriving self-supervision signals with effectively learning of modality-aware user preference and cross-modal dependencies. To this end, we propose a new Multi-Modal Self-Supervised Learning (MMSSL) method which tackles two key challenges. Specifically, to characterize the inter-dependency between the user-item…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
