Enhancing Multimodal Recommendations with Vision-Language Models and Information-Aware Fusion
Hai-Dang Kieu, Min Xu, Thanh Trung Huynh, Dung D. Le

TL;DR
VIRAL introduces a novel multimodal recommendation framework that leverages fine-grained visual descriptions and an information-aware fusion strategy to improve item representations and recommendation accuracy.
Contribution
The paper presents VIRAL, a new framework combining vision-language models and information theory for better multimodal fusion in recommendations.
Findings
VIRAL outperforms existing multimodal recommendation baselines.
Visual feature contribution is significantly enhanced.
The framework effectively disentangles shared and unique modality information.
Abstract
Recent advances in multimodal recommendation (MMR) highlight the potential of integrating visual and textual content to enrich item representations. However, existing methods often rely on coarse visual features and naive fusion strategies, resulting in redundant or misaligned representations. From an information-theoretic perspective, effective fusion should balance unique, shared, and redundant modality information to preserve complementary cues. To this end, we propose VIRAL, a novel Vision-Language and Information-aware Recommendation framework that enhances multimodal fusion through two components: (i) a VLM-based visual enrichment module that generates fine-grained, title-guided descriptions for semantically aligned image representations, and (ii) an information-aware fusion module inspired by Partial Information Decomposition (PID) to disentangle and integrate complementary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques
