Does Multimodality Improve Recommender Systems as Expected? A Critical Analysis and Future Directions
Hongyu Zhou, Yinan Zhang, Aixin Sun, Zhiqi Shen

TL;DR
This paper critically evaluates the actual benefits of multimodal data in recommender systems, proposing a framework to assess their performance and providing insights on when and how multimodality improves recommendations.
Contribution
It introduces a structured evaluation framework for multimodal recommenders and benchmarks various models, revealing task-specific modality importance and effective integration strategies.
Findings
Multimodal data benefits are prominent in sparse interaction scenarios.
Text features excel in e-commerce, visual features in short-video recommendations.
Ensemble-Based Learning outperforms Fusion-Based Learning.
Abstract
Multimodal recommendation systems are increasingly popular for their potential to improve performance by integrating diverse data types. However, the actual benefits of this integration remain unclear, raising questions about when and how it truly enhances recommendations. In this paper, we propose a structured evaluation framework to systematically assess multimodal recommendations across four dimensions: Comparative Efficiency, Recommendation Tasks, Recommendation Stages, and Multimodal Data Integration. We benchmark a set of reproducible multimodal models against strong traditional baselines and evaluate their performance on different platforms. Our findings show that multimodal data is particularly beneficial in sparse interaction scenarios and during the recall stage of recommendation pipelines. We also observe that the importance of each modality is task-specific, where text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
