Are Multimodal Embeddings Truly Beneficial for Recommendation? A Deep Dive into Whole vs. Individual Modalities
Yu Ye, Junchen Fu, Yu Song, Kaiwen Zheng, Joemon M. Jose

TL;DR
This study empirically evaluates whether multimodal embeddings improve recommendation systems, finding that text alone often suffices and that complex fusion methods yield better results than simple ones.
Contribution
It provides a comprehensive empirical analysis of the benefits of multimodal embeddings in recommendation, highlighting the effectiveness of sophisticated fusion and the sufficiency of text modality alone.
Findings
Multimodal embeddings generally improve recommendation performance.
Text modality alone often matches full multimodal performance.
Simple fusion models show limited gains compared to complex graph-based methods.
Abstract
Multimodal recommendation has emerged as a mainstream paradigm, typically leveraging text and visual embeddings extracted from pre-trained models such as Sentence-BERT, Vision Transformers, and ResNet. This approach is founded on the intuitive assumption that incorporating multimodal embeddings can enhance recommendation performance. However, despite its popularity, this assumption lacks comprehensive empirical verification. This presents a critical research gap. To address it, we pose the central research question of this paper: Are multimodal embeddings truly beneficial for recommendation? To answer this question, we conduct a large-scale empirical study examining the role of text and visual embeddings in modern multimodal recommendation models, both as a whole and individually. Specifically, we pose two key research questions: (1) Do multimodal embeddings as a whole improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Recommender Systems and Techniques · Explainable Artificial Intelligence (XAI)
