Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Daiwei Chen; Zhoutong Fu; Chengming Jiang; Haichao Zhang; Ran Zhou; Tan Wang; Chunnan Yao; Guoyao Li; Rui Cai; Yihan Cao; Ruijie Jiang; Fedor Borisyuk; Jianqiang Shen; Jingwei Wu; Ramya Korlakai Vinayak

arXiv:2604.02324·cs.CL·April 3, 2026

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang, Ran Zhou, Tan Wang, Chunnan Yao, Guoyao Li, Rui Cai, Yihan Cao, Ruijie Jiang, Fedor Borisyuk, Jianqiang Shen, Jingwei Wu, Ramya Korlakai Vinayak

PDF

TL;DR

This paper identifies the limitations of mean initialization for new vocabulary tokens in language models and proposes a grounded initialization method that improves token representation and model performance in generative recommendation tasks.

Contribution

It introduces Grounded Token Initialization (GTI), a simple yet effective method to semantically ground new tokens in pretrained embeddings before fine-tuning.

Findings

01

GTI outperforms mean initialization and existing methods in multiple benchmarks.

02

Grounded embeddings maintain richer inter-token structure after fine-tuning.

03

Token initialization quality significantly impacts vocabulary extension effectiveness.

Abstract

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.