Deep Interest Mining with Cross-Modal Alignment for SemanticID Generation in Generative Recommendation
Yangchen Zeng, Jinze Wang

TL;DR
This paper introduces a novel framework for Semantic ID generation in generative recommendation, addressing semantic degradation, modality misalignment, and information loss through cross-modal alignment, interest mining, and reinforcement learning.
Contribution
It proposes an integrated approach combining deep interest mining, cross-modal semantic alignment, and quality-aware reinforcement to improve SID quality and modality consistency.
Findings
Outperforms state-of-the-art SID generation methods on multiple benchmarks.
Effectively aligns text and image modalities using vision-language models.
Enhances semantic preservation and reduces information degradation in SID generation.
Abstract
Generative Recommendation (GR) has demonstrated remarkable performance in next-token prediction paradigms, which relies on Semantic IDs (SIDs) to compress trillion-scale data into learnable vocabulary sequences. However, existing methods suffer from three critical limitations: (1) Information Degradation: the two-stage compression pipeline causes semantic loss and information degradation, with no posterior mechanism to distinguish high-quality from low-quality SIDs; (2) Semantic Degradation: cascaded quantization discards key semantic information from original multimodal features, as the embedding generation and quantization stages are not jointly optimized toward a unified objective; (3) Modality Distortion: quantizers fail to properly align text and image modalities, causing feature misalignment even when upstream networks have aligned them. To address these challenges, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
