MMGRec: Multimodal Generative Recommendation with Transformer Model

Han Liu; Yinwei Wei; Xuemeng Song; Weili Guan; Yuan-Fang Li; Liqiang Nie

arXiv:2404.16555·cs.IR·January 15, 2026·1 cites

MMGRec: Multimodal Generative Recommendation with Transformer Model

Han Liu, Yinwei Wei, Xuemeng Song, Weili Guan, Yuan-Fang Li, Liqiang Nie

PDF

Open Access

TL;DR

This paper introduces MMGRec, a novel multimodal recommendation model that employs a generative approach with a Transformer and hierarchical quantization to improve recommendation accuracy and efficiency.

Contribution

The paper proposes a generative paradigm for multimodal recommendation using a Transformer and a new hierarchical quantization method, addressing limitations of previous embed-and-retrieve models.

Findings

01

Outperforms state-of-the-art methods in recommendation accuracy

02

Reduces inference cost compared to traditional embed-and-retrieve models

03

Effectively models non-sequential interaction data with relation-aware self-attention

Abstract

Multimodal recommendation aims to recommend user-preferred candidates based on her/his historically interacted items and associated multimodal information. Previous studies commonly employ an embed-and-retrieve paradigm: learning user and item representations in the same embedding space, then retrieving similar candidate items for a user via embedding inner product. However, this paradigm suffers from inference cost, interaction modeling, and false-negative issues. Toward this end, we propose a new MMGRec model to introduce a generative paradigm into multimodal recommendation. Specifically, we first devise a hierarchical quantization method Graph RQ-VAE to assign Rec-ID for each item from its multimodal and CF information. Consisting of a tuple of semantically meaningful tokens, Rec-ID serves as the unique identifier of each item. Afterward, we train a Transformer-based recommender to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Recommender Systems and Techniques · Speech and dialogue systems

MethodsAttention Is All You Need · Dropout · Dense Connections · Label Smoothing · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Linear Layer · Byte Pair Encoding · Absolute Position Encodings