CEMG: Collaborative-Enhanced Multimodal Generative Recommendation

Yuzhen Lin; Hongyi Chen; Xuanjing Chen; Shaowen Wang; Ivonne Xu; and Dongming Jiang

arXiv:2512.21543·cs.IR·December 29, 2025

CEMG: Collaborative-Enhanced Multimodal Generative Recommendation

Yuzhen Lin, Hongyi Chen, Xuanjing Chen, Shaowen Wang, Ivonne Xu, and Dongming Jiang

PDF

Open Access

TL;DR

CEMG introduces a novel framework that enhances multimodal recommendation by dynamically integrating visual and textual features with collaborative signals, converting them into semantic codes, and generating recommendations using a fine-tuned language model.

Contribution

This work presents a new multimodal recommendation framework that effectively combines collaborative signals with visual and textual features through a unified, generative approach.

Findings

01

CEMG significantly outperforms existing baselines in recommendation accuracy.

02

The proposed fusion and tokenization methods improve item representation quality.

03

End-to-end training with a language model enhances recommendation diversity and relevance.

Abstract

Generative recommendation models often struggle with two key challenges: (1) the superficial integration of collaborative signals, and (2) the decoupled fusion of multimodal features. These limitations hinder the creation of a truly holistic item representation. To overcome this, we propose CEMG, a novel Collaborative-Enhaned Multimodal Generative Recommendation framework. Our approach features a Multimodal Fusion Layer that dynamically integrates visual and textual features under the guidance of collaborative signals. Subsequently, a Unified Modality Tokenization stage employs a Residual Quantization VAE (RQ-VAE) to convert this fused representation into discrete semantic codes. Finally, in the End-to-End Generative Recommendation stage, a large language model is fine-tuned to autoregressively generate these item codes. Extensive experiments demonstrate that CEMG significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Multimodal Machine Learning Applications · Emotion and Mood Recognition