Multimodal Generative Recommendation for Fusing Semantic and Collaborative Signals
Moritz Vandenhirtz, Kaveh Hassani, Shervin Ghasemlou, Shuai Shao, Hamid Eghbalzadeh, Fuchun Peng, Jun Liu, Michael Louis Iuzzolino

TL;DR
This paper introduces MSCGRec, a multimodal generative recommender that fuses semantic and collaborative signals, employing self-supervised image quantization and constrained sequence learning to outperform existing methods on large datasets.
Contribution
The paper proposes MSCGRec, a novel multimodal generative recommender integrating semantic modalities and collaborative signals with a new constrained sequence learning approach.
Findings
Outperforms baseline methods on three large datasets
Effective fusion of semantic and collaborative signals
Validated through extensive ablation studies
Abstract
Sequential recommender systems rank relevant items by modeling a user's interaction history and computing the inner product between the resulting user representation and stored item embeddings. To avoid the significant memory overhead of storing large item sets, the generative recommendation paradigm instead models each item as a series of discrete semantic codes. Here, the next item is predicted by an autoregressive model that generates the code sequence corresponding to the predicted item. However, despite promising ranking capabilities on small datasets, these methods have yet to surpass traditional sequential recommenders on large item sets, limiting their adoption in the very scenarios they were designed to address. To resolve this, we propose MSCGRec, a Multimodal Semantic and Collaborative Generative Recommender. MSCGRec incorporates multiple semantic modalities and introduces a…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper motivates the use of RQ for collaborative embeddings by arguing that CF signals inherently possess multi-level semantics. While the overall performance gains and ablations confirm the usefulness of incorporating CF as a discrete modality, there is no diagnostic analysis demonstrating that different RQ levels actually capture distinct semantic granularity (e.g., global clusters vs. fine-grained item distinctions). A more direct validation—such as layer-wise probing, codebook visualiz
1. The paper assumes that CF embeddings exhibit hierarchical semantics suitable for residual quantization, yet provides no empirical diagnostics (e.g., layer-wise probing or semantic clustering) validating that RQ indeed captures different levels of collaborative structure. This weakens the core motivation for applying RQ to CF signals. 2. Converting CF embeddings into discrete tokens may unintentionally encode implicit item identity, especially in high-capacity codebooks. This raises the conce
1. This paper advances the emerging generative recommendation paradigm by introducing a principled multimodal framework tha integrates collaborative signals alongside semantic modalities treating them as complementary information sources rather than competing objectives. 2. The paper is well-structured with clear motivation, comprehensive ablation studies, and transparent implementation details. The progression from problem identification to solution design is logical and easy to follow. 3. The
1. The paper motivates generative recommendation as a solution to the memory and scalability challenges of traditional sequential models. However, MSCGRec fundamentally depends on SASRec embeddings as the collaborative modality, creating a circular dependency. If the generative approach still requires training a full sequential model as a prerequisite, the claimed advantages (reduced memory footprint, avoiding ANN search) become questionable. This undermines the core value proposition of the gen
1. **High Significance and Impact:** The paper addresses a core limitation of the current generative recommendation (Gen-Rec) paradigm: its failure to outperform strong sequential recommenders on large-scale datasets. By presenting a model that (mostly) matches or exceeds strong baselines like SASRec, this work represents a meaningful step toward the practical adoption of scalable generative models. 2. **Novel and Effective Fusion:** The core idea of treating collaborative features (i.e., item e
1. **Lack of Statistical Rigor:** The paper reports no confidence intervals or standard deviations over multiple runs. This is a critical omission, as many of the performance improvements are marginal (e.g., on the Beauty dataset, MSCGRec's R@10 of 0.0315 is *lower* than SASRec's 0.0317 ). Without statistical tests, the claim of "surpassing" sequential models is not adequately supported. 2. **Critical Missing Analysis on Collisions:** The model uses a very small codebook (3 levels, 256 entries e
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
