Token Distillation: Attention-aware Input Embeddings For New Tokens

Konstantin Dobler; Desmond Elliott; Gerard de Melo

arXiv:2505.20133·cs.CL·March 16, 2026

Token Distillation: Attention-aware Input Embeddings For New Tokens

Konstantin Dobler, Desmond Elliott, Gerard de Melo

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Token Distillation, a method to efficiently initialize embeddings for new tokens in language models by distilling representations from original tokenizations, improving performance without extensive retraining.

Contribution

The paper proposes a novel Token Distillation technique that enables quick and effective embedding initialization for new tokens, outperforming existing methods.

Findings

01

Token Distillation outperforms strong baselines.

02

It enables rapid learning of high-quality embeddings.

03

Applicable across various open-weight models.

Abstract

Current language models rely on static vocabularies determined at pretraining time, which can lead to decreased performance and increased computational cost for domains underrepresented in the original vocabulary. New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. However, existing embedding initialization methods require expensive further training or pretraining of additional modules. In this paper, we propose Token Distillation and show that by distilling representations obtained using the original tokenization, we can quickly learn high-quality input embeddings for new tokens. Experimental results with a wide range of open-weight models show that Token Distillation outperforms even strong baselines.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper is well-motivated. The paper clearly identifies the problem with conventional methods for embedding initialization of new tokens (i.e., not accounting for higher-layer model dynamics) and the necessity of training an auxiliary model for sophisticated methods like ZeTT. The proposed method clearly addresses these issues and employs a lightweight training-based approach that aims to make the representations of a new token and its corresponding source tokens similar using an MSE loss.

Weaknesses

1. The major limitation of this paper is a lack of full continual pre-training results. While the proposed method can provide a better starting point as “initialization”, it does not always guarantee a better performance after continual pre-training on target data. Any gains seen at the starting point might not hold up when we conduct further tuning on the target data. Given that almost all methods exhibit worse performance after embedding adaptation, they inevitably require continual pre-traini

Reviewer 02Rating 4Confidence 3

Strengths

1. Novel and Effective: The paper proposes a clever and elegant method, "Token Distillation," which distills the model's internal behavior (hidden states) rather than simply aggregating embedding vectors. Experiments robustly show it outperforms strong baselines. 2. Thorough Experimental Validation: The claims are supported by rigorous experiments across a diverse set of models, tasks (domain and language adaptation), and a comprehensive suite of baselines, demonstrating the method's reliability

Weaknesses

1. Insufficient Justification for Practical Significance: The paper's primary weakness is the lack of compelling evidence for why the proposed efficiency-performance trade-off is critical. In many high-stakes domains, even a small performance drop is unacceptable, and the paper fails to demonstrate scenarios where the efficiency gain from token compression is a mission-critical requirement rather than a minor convenience. 2. Misaligned Motivation: There is a narrative disconnect. The motivation

Reviewer 03Rating 4Confidence 4

Strengths

- It is interesting to incorporate the Transformer layer into the initialization of new input embedding parameters. - The experimental results are promising for Transformer decoder models in the vocabulary adaptation task.

Weaknesses

- Experiments are conducted on the Transformer decoder models. It is unclear for its performance on the vocabulary adaptation of Transformers encoder models. - Missing details that the number of tokens added in the experiments. The performance of this method may be affected by the lexical similarity between the target token and source token. For example, if all target tokens like Arabic or Chinese words are significantly different to the source tokens like English words and original subtokens $

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · EEG and Brain-Computer Interfaces · Advanced Graph Neural Networks