When Text-as-Vision Meets Semantic IDs in Generative Recommendation: An Empirical Study

Shutong Qiao; Wei Yuan; Tong Chen; Xiangyu Zhao; Quoc Viet Hung Nguyen; Hongzhi Yin

arXiv:2601.14697·cs.IR·January 22, 2026

When Text-as-Vision Meets Semantic IDs in Generative Recommendation: An Empirical Study

Shutong Qiao, Wei Yuan, Tong Chen, Xiangyu Zhao, Quoc Viet Hung Nguyen, Hongzhi Yin

PDF

Open Access

TL;DR

This paper explores using OCR-based visual representations of item descriptions for Semantic ID learning in generative recommendation, demonstrating improved robustness and effectiveness over traditional text encoders, especially in multimodal settings.

Contribution

The study systematically evaluates OCR-based visual text representations for Semantic ID learning, showing they outperform standard text encoders in various recommendation scenarios.

Findings

01

OCR-text matches or exceeds standard text embeddings in performance.

02

OCR-based Semantic IDs are robust under spatial-resolution compression.

03

OCR representations improve cross-modal fusion stability.

Abstract

Semantic ID learning is a key interface in Generative Recommendation (GR) models, mapping items to discrete identifiers grounded in side information, most commonly via a pretrained text encoder. However, these text encoders are primarily optimized for well-formed natural language. In real-world recommendation data, item descriptions are often symbolic and attribute-centric, containing numerals, units, and abbreviations. These text encoders can break these signals into fragmented tokens, weakening semantic coherence and distorting relationships among attributes. Worse still, when moving to multimodal GR, relying on standard text encoders introduces an additional obstacle: text and image embeddings often exhibit mismatched geometric structures, making cross-modal fusion less effective and less stable. In this paper, we revisit representation design for Semantic ID learning by treating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis