Q-BERT4Rec: Quantized Semantic-ID Representation Learning for Multimodal Recommendation
Haofeng Huang, Ling Gai

TL;DR
Q-Bert4Rec introduces a multimodal sequential recommendation framework that enhances item representations with rich semantic information through cross-modal fusion and quantization, leading to improved recommendation accuracy.
Contribution
It proposes a novel framework combining semantic injection, quantization, and multi-mask pretraining for multimodal sequential recommendation, addressing limitations of existing ID-based methods.
Findings
Q-Bert4Rec outperforms existing methods on Amazon benchmarks.
Semantic tokenization improves model interpretability.
Multimodal fusion enhances recommendation accuracy.
Abstract
Sequential recommendation plays a critical role in modern online platforms such as e-commerce, advertising, and content streaming, where accurately predicting users' next interactions is essential for personalization. Recent Transformer-based methods like BERT4Rec have shown strong modeling capability, yet they still rely on discrete item IDs that lack semantic meaning and ignore rich multimodal information (e.g., text and image). This leads to weak generalization and limited interpretability. To address these challenges, we propose Q-Bert4Rec, a multimodal sequential recommendation framework that unifies semantic representation and quantized modeling. Specifically, Q-Bert4Rec consists of three stages: (1) cross-modal semantic injection, which enriches randomly initialized ID embeddings through a dynamic transformer that fuses textual, visual, and structural features; (2) semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks
