Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation
Claudio Pomo, Matteo Attimonelli, Danilo Danese, Fedelucio Narducci, and Tommaso Di Noia

TL;DR
This paper investigates whether multimodal recommender systems truly leverage multimodal content or if their improvements are due to increased complexity, proposing LVLMs for semantically aligned embeddings that enhance recommendation performance.
Contribution
The paper introduces the use of Large Vision-Language Models to generate semantically aligned multimodal embeddings without fusion, demonstrating improved recommendation accuracy and interpretability.
Findings
LVLMs produce semantically aligned multimodal embeddings.
Using LVLMs embeddings improves recommendation performance.
Decoded descriptions from embeddings enhance understanding and validation.
Abstract
Multimodal Recommender Systems aim to improve recommendation accuracy by integrating heterogeneous content, such as images and textual metadata. While effective, it remains unclear whether their gains stem from true multimodal understanding or increased model complexity. This work investigates the role of multimodal item embeddings, emphasizing the semantic informativeness of the representations. Initial experiments reveal that embeddings from standard extractors (e.g., ResNet50, Sentence-Bert) enhance performance, but rely on modality-specific encoders and ad hoc fusion strategies that lack control over cross-modal alignment. To overcome these limitations, we leverage Large Vision-Language Models (LVLMs) to generate multimodal-by-design embeddings via structured prompts. This approach yields semantically aligned representations without requiring any fusion. Experiments across multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
