One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers
Diana Abagyan, Alejandro R. Salamanca, Andres Felipe Cruz-Salinas, Kris Cao, Hangyu Lin, Acyr Locatelli, Marzieh Fadaee, Ahmet \"Ust\"un, Sara Hooker

TL;DR
This paper introduces a universal tokenizer trained on many languages to improve multilingual model adaptability, enabling better language coverage and adaptation with minimal performance trade-offs.
Contribution
The study demonstrates that a universal tokenizer enhances language adaptation and plasticity in multilingual models compared to language-specific tokenizers.
Findings
Up to 20.2% increase in win rates for language adaptation.
Improved plasticity for unseen languages by up to 5%.
Minimal performance compromise on pretraining languages.
Abstract
Pretraining massively multilingual Large Language Models (LLMs) for many languages at once is challenging due to limited model capacity, scarce high-quality data, and compute constraints. Moreover, the lack of language coverage of the tokenizer makes it harder to address the gap for new languages purely at the post-training stage. In this work, we study what relatively cheap interventions early on in training improve "language plasticity", or adaptation capabilities of the model post-training to new languages. We focus on tokenizer design and propose using a universal tokenizer that is trained for more languages than the primary pretraining languages to enable efficient adaptation in expanding language coverage after pretraining. Our systematic experiments across diverse groups of languages and different training strategies show that a universal tokenizer enables significantly higher…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper presents a clear and thoughtful study of how tokenizer design influences multilingual adaptability. - Its originality lies in approaching language plasticity (Chen et al 2023) through a simple, pretraining-time intervention rather than architectural or post-hoc changes. - The experiments are broad, well controlled, and convincingly demonstrate that a universal tokenizer improves cross-lingual transfer without harming high-resource performance.
- The experiments, though broad, are confined to mid-sized models; it remains unclear whether the observed gains in adaptation speed and multilingual coverage hold at larger scales. - The evaluation relies heavily on LLM-as-judge metrics. Adding human or cross-judge validation, especially for languages with distinct scripts, would bolster confidence in the reported improvements. - The paper’s discussion of prior work on language plasticity is limited. It briefly cites Chen et al. (2023) to d
1. Addresses an important challenge in LLMs regarding the utility of tokenizers for diverse languages. 2. Experiments are extensive, and those sections are well written and clearly explained. 3. An improved multilingual tokenizer would be of broad interest to the multilingual LLM community.
1. The expanded (seen languages) setup, which is the main focus of the paper, feels somewhat artificial. Since the Universal tokenizer sees data from all languages, it seems like an unfair comparison because if pretraining data for these languages are already available to build the tokenizers, one could simply create a single tokenizer using an upweighting strategy and then pretrain on all available data. That would likely be the practical approach. It is difficult to imagine a real-world scenar
- The experiments are very extensive and well-controlled (69 languages, 3 language clusters + 7 unseen languages, multiple adaptation regimes). - There is clear evidence of improvement in plasticity and adaptation efficiency without trade-offs on primary languages. - The authors provide clear evidence that a universal tokenizer is a simple, yet scalable and low-cost method to enhance multilingual coverage.
- The use of LLM-as-a-Judge for open-ended generation is reasonable but subjective. Given that 15 adaptation languages are evaluated, including even a small-scale human verification subset would strengthen the credibility of the win-rate results. - Experiments are conducted on a 3.3B model. Since multilingual capability often scales with model size and inference budget [1], it remains unclear whether the Universal Tokenizer's benefits persist, or diminish, at larger scales. I understand the di
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
MethodsFocus · Sparse Evolutionary Training
