Enhancing Multimodal Recommendations with Vision-Language Models and Information-Aware Fusion

Hai-Dang Kieu; Min Xu; Thanh Trung Huynh; Dung D. Le

arXiv:2511.02113·cs.IR·November 11, 2025

Enhancing Multimodal Recommendations with Vision-Language Models and Information-Aware Fusion

Hai-Dang Kieu, Min Xu, Thanh Trung Huynh, Dung D. Le

PDF

Open Access

TL;DR

VIRAL introduces a novel multimodal recommendation framework that leverages fine-grained visual descriptions and an information-aware fusion strategy to improve item representations and recommendation accuracy.

Contribution

The paper presents VIRAL, a new framework combining vision-language models and information theory for better multimodal fusion in recommendations.

Findings

01

VIRAL outperforms existing multimodal recommendation baselines.

02

Visual feature contribution is significantly enhanced.

03

The framework effectively disentangles shared and unique modality information.

Abstract

Recent advances in multimodal recommendation (MMR) highlight the potential of integrating visual and textual content to enrich item representations. However, existing methods often rely on coarse visual features and naive fusion strategies, resulting in redundant or misaligned representations. From an information-theoretic perspective, effective fusion should balance unique, shared, and redundant modality information to preserve complementary cues. To this end, we propose VIRAL, a novel Vision-Language and Information-aware Recommendation framework that enhances multimodal fusion through two components: (i) a VLM-based visual enrichment module that generates fine-grained, title-guided descriptions for semantically aligned image representations, and (ii) an information-aware fusion module inspired by Partial Information Decomposition (PID) to disentangle and integrate complementary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques