Multimodal Generative Recommendation for Fusing Semantic and Collaborative Signals

Moritz Vandenhirtz; Kaveh Hassani; Shervin Ghasemlou; Shuai Shao; Hamid Eghbalzadeh; Fuchun Peng; Jun Liu; Michael Louis Iuzzolino

arXiv:2602.03713·cs.IR·February 4, 2026

Multimodal Generative Recommendation for Fusing Semantic and Collaborative Signals

Moritz Vandenhirtz, Kaveh Hassani, Shervin Ghasemlou, Shuai Shao, Hamid Eghbalzadeh, Fuchun Peng, Jun Liu, Michael Louis Iuzzolino

PDF

Open Access 3 Reviews

TL;DR

This paper introduces MSCGRec, a multimodal generative recommender that fuses semantic and collaborative signals, employing self-supervised image quantization and constrained sequence learning to outperform existing methods on large datasets.

Contribution

The paper proposes MSCGRec, a novel multimodal generative recommender integrating semantic modalities and collaborative signals with a new constrained sequence learning approach.

Findings

01

Outperforms baseline methods on three large datasets

02

Effective fusion of semantic and collaborative signals

03

Validated through extensive ablation studies

Abstract

Sequential recommender systems rank relevant items by modeling a user's interaction history and computing the inner product between the resulting user representation and stored item embeddings. To avoid the significant memory overhead of storing large item sets, the generative recommendation paradigm instead models each item as a series of discrete semantic codes. Here, the next item is predicted by an autoregressive model that generates the code sequence corresponding to the predicted item. However, despite promising ranking capabilities on small datasets, these methods have yet to surpass traditional sequential recommenders on large item sets, limiting their adoption in the very scenarios they were designed to address. To resolve this, we propose MSCGRec, a Multimodal Semantic and Collaborative Generative Recommender. MSCGRec incorporates multiple semantic modalities and introduces a…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

1. The paper motivates the use of RQ for collaborative embeddings by arguing that CF signals inherently possess multi-level semantics. While the overall performance gains and ablations confirm the usefulness of incorporating CF as a discrete modality, there is no diagnostic analysis demonstrating that different RQ levels actually capture distinct semantic granularity (e.g., global clusters vs. fine-grained item distinctions). A more direct validation—such as layer-wise probing, codebook visualiz

Weaknesses

1. The paper assumes that CF embeddings exhibit hierarchical semantics suitable for residual quantization, yet provides no empirical diagnostics (e.g., layer-wise probing or semantic clustering) validating that RQ indeed captures different levels of collaborative structure. This weakens the core motivation for applying RQ to CF signals. 2. Converting CF embeddings into discrete tokens may unintentionally encode implicit item identity, especially in high-capacity codebooks. This raises the conce

Reviewer 02Rating 4Confidence 4

Strengths

1. This paper advances the emerging generative recommendation paradigm by introducing a principled multimodal framework tha integrates collaborative signals alongside semantic modalities treating them as complementary information sources rather than competing objectives. 2. The paper is well-structured with clear motivation, comprehensive ablation studies, and transparent implementation details. The progression from problem identification to solution design is logical and easy to follow. 3. The

Weaknesses

1. The paper motivates generative recommendation as a solution to the memory and scalability challenges of traditional sequential models. However, MSCGRec fundamentally depends on SASRec embeddings as the collaborative modality, creating a circular dependency. If the generative approach still requires training a full sequential model as a prerequisite, the claimed advantages (reduced memory footprint, avoiding ANN search) become questionable. This undermines the core value proposition of the gen

Reviewer 03Rating 6Confidence 4

Strengths

1. **High Significance and Impact:** The paper addresses a core limitation of the current generative recommendation (Gen-Rec) paradigm: its failure to outperform strong sequential recommenders on large-scale datasets. By presenting a model that (mostly) matches or exceeds strong baselines like SASRec, this work represents a meaningful step toward the practical adoption of scalable generative models. 2. **Novel and Effective Fusion:** The core idea of treating collaborative features (i.e., item e

Weaknesses

1. **Lack of Statistical Rigor:** The paper reports no confidence intervals or standard deviations over multiple runs. This is a critical omission, as many of the performance improvements are marginal (e.g., on the Beauty dataset, MSCGRec's R@10 of 0.0315 is *lower* than SASRec's 0.0317 ). Without statistical tests, the claim of "surpassing" sequential models is not adequately supported. 2. **Critical Missing Analysis on Collisions:** The model uses a very small codebook (3 levels, 256 entries e

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis