Zero-Shot Tokenizer Transfer
Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vuli\'c

TL;DR
This paper introduces Zero-Shot Tokenizer Transfer (ZeTT), a method to swap tokenizers in language models without performance loss by training a hypernetwork to generate token embeddings, enabling greater flexibility across languages and tasks.
Contribution
We propose a hypernetwork-based approach for zero-shot tokenizer transfer that generalizes to new tokenizers and reduces sequence length, improving model flexibility and efficiency.
Findings
Hypernetwork predicts embeddings for new tokenizers effectively.
Performance close to original models in multilingual and coding tasks.
Remaining gaps can be closed with less than 1B tokens of additional training.
Abstract
Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programming languages, but have vastly decreased efficiency due to their English-centric tokenizer. To mitigate this, we should be able to swap the original LM tokenizer with an arbitrary one, on the fly, without degrading performance. Hence, in this work we define a new problem: Zero-Shot Tokenizer Transfer (ZeTT). The challenge at the core of ZeTT is finding embeddings for the tokens in the vocabulary of the new tokenizer. Since prior heuristics for initializing embeddings often perform at chance level in a ZeTT setting, we propose a new solution: we train a hypernetwork taking a tokenizer as input and predicting the corresponding embeddings. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsOrganoboron and organosilicon chemistry
MethodsBalanced Selection · HyperNetwork
