CEMG: Collaborative-Enhanced Multimodal Generative Recommendation
Yuzhen Lin, Hongyi Chen, Xuanjing Chen, Shaowen Wang, Ivonne Xu, and Dongming Jiang

TL;DR
CEMG introduces a novel framework that enhances multimodal recommendation by dynamically integrating visual and textual features with collaborative signals, converting them into semantic codes, and generating recommendations using a fine-tuned language model.
Contribution
This work presents a new multimodal recommendation framework that effectively combines collaborative signals with visual and textual features through a unified, generative approach.
Findings
CEMG significantly outperforms existing baselines in recommendation accuracy.
The proposed fusion and tokenization methods improve item representation quality.
End-to-end training with a language model enhances recommendation diversity and relevance.
Abstract
Generative recommendation models often struggle with two key challenges: (1) the superficial integration of collaborative signals, and (2) the decoupled fusion of multimodal features. These limitations hinder the creation of a truly holistic item representation. To overcome this, we propose CEMG, a novel Collaborative-Enhaned Multimodal Generative Recommendation framework. Our approach features a Multimodal Fusion Layer that dynamically integrates visual and textual features under the guidance of collaborative signals. Subsequently, a Unified Modality Tokenization stage employs a Residual Quantization VAE (RQ-VAE) to convert this fused representation into discrete semantic codes. Finally, in the End-to-End Generative Recommendation stage, a large language model is fine-tuned to autoregressively generate these item codes. Extensive experiments demonstrate that CEMG significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Multimodal Machine Learning Applications · Emotion and Mood Recognition
