MMQ: Multimodal Mixture-of-Quantization Tokenization for Semantic ID Generation and User Behavioral Adaptation

Yi Xu; Moyu Zhang; Chenxuan Li; Zhihao Liao; Haibo Xing; Hao Deng; Jinxin Hu; Yu Zhang; Xiaoyi Zeng; Jing Zhang

arXiv:2508.15281·cs.IR·March 3, 2026

MMQ: Multimodal Mixture-of-Quantization Tokenization for Semantic ID Generation and User Behavioral Adaptation

Yi Xu, Moyu Zhang, Chenxuan Li, Zhihao Liao, Haibo Xing, Hao Deng, Jinxin Hu, Yu Zhang, Xiaoyi Zeng, Jing Zhang

PDF

TL;DR

The paper introduces MMQ, a multimodal tokenizer that generates semantic IDs for items, improving recommendation accuracy and scalability by effectively combining content modalities and adapting to user behavior.

Contribution

It proposes a novel two-stage framework with a shared-specific tokenizer and behavior-aware fine-tuning to enhance semantic ID generation for recommender systems.

Findings

01

Outperforms existing methods in offline experiments

02

Improves recommendation quality in online A/B tests

03

Effectively balances multimodal synergy and specificity

Abstract

Recommender systems traditionally represent items using unique identifiers (ItemIDs), but this approach struggles with large, dynamic item corpora and sparse long-tail data, limiting scalability and generalization. Semantic IDs, derived from multimodal content such as text and images, offer a promising alternative by mapping items into a shared semantic space, enabling knowledge transfer and improving recommendations for new or rare items. However, existing methods face two key challenges: (1) balancing cross-modal synergy with modality-specific uniqueness, and (2) bridging the semantic-behavioral gap, where semantic representations may misalign with actual user preferences. To address these challenges, we propose Multimodal Mixture-of-Quantization (MMQ), a two-stage framework that trains a novel multimodal tokenizer. First, a shared-specific tokenizer leverages a multi-expert…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.