TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior
G\"ul Sena Alt{\i}nta\c{s}, Malikeh Ehghaghi, Brian Lester, Fengyuan Liu, Wanru Zhao, Marco Ciccone, Colin Raffel

TL;DR
TokSuite is a comprehensive benchmark and set of models designed to isolate and measure the impact of different tokenizers on language model performance and behavior, revealing their respective strengths and weaknesses.
Contribution
This work introduces TokSuite, a novel benchmark and a collection of models trained with various tokenizers, enabling detailed analysis of tokenizer effects on language models.
Findings
Different tokenizers significantly influence model performance.
Some tokenizers better handle real-world text perturbations.
Tokenization choices impact model robustness and efficiency.
Abstract
Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we train fourteen models that use different tokenizers but are otherwise identical using the same architecture, dataset, training budget, and initialization. Additionally, we curate and release a new benchmark that specifically measures model performance subject to real-world perturbations that are likely to influence tokenization. Together, TokSuite allows robust decoupling of the influence of a model's tokenizer, supporting a series of novel…
Peer Reviews
Decision·Submitted to ICLR 2026
The study is systematic and broad.
Training new models has the problem that the models are comparetively small and hence one needs to wonder if the findings are transferable I would have loved to see the byte latent transformer in the study as well as decoding methods that target byte level decoding such as Phan 2025.
* This paper conducts an in-depth study of tokenizers in large language models. By keeping all other conditions constant and varying only the tokenizer, the authors train 14 different models and perform a unified comparison. * The design of the TokSuite benchmark is highly targeted. Instead of using standard, clean evaluations, it focuses on “perturbations” that specifically test tokenizer weaknesses, with particular attention to multilingual, math/STEM, and Unicode formatting scenarios. * The p
* Most of the reported results only include the mean values, lacking variance and statistical significance tests. Given that many tokenizers show only small differences in performance, could these results be influenced by random factors? * In this paper, the authors control the total number of training tokens to be consistent across models, but the total number of training texts varies significantly between datasets. Could this be a major factor contributing to performance differences, such as d
1.The authors constructed a high-quality TokSuite Benchmark. 2.The authors evaluated 14 different tokenizers and provided several insightful conclusions based on the experimental results.
1.In Table 1, the average (Avg.) results across the 14 tokenizers show little variation—except for TokenMonster and Tekken, the remaining 12 tokenizers perform similarly. This raises the question of whether the benchmark is sufficiently sensitive to differentiate between the effectiveness of different tokenizers. 2.The benchmark does not evaluate generation, translation, or code-related tasks. Since the impact of tokenization can vary significantly across task types, the applicability of the co
Code & Models
- 🤗toksuite/google-gemma-2-2bmodel· 375 dl375 dl
- 🤗toksuite/common-pile-comma-v0.1model· 19 dl19 dl
- 🤗toksuite/meta-llama-Llama-3.2-1Bmodel· 23 dl23 dl
- 🤗toksuite/microsoft-Phi-3-mini-4k-instructmodel· 19 dl19 dl
- 🤗toksuite/gpt2model· 371 dl371 dl
- 🤗toksuite/bigscience-bloommodel· 124 dl124 dl
- 🤗toksuite/facebook-xglm-564Mmodel· 128 dl128 dl
- 🤗toksuite/mistralai-tekkenmodel· 12 dl12 dl
- 🤗toksuite/google-byt5-smallmodel· 151 dl151 dl
- 🤗toksuite/tokenmonster-englishcode-32000-consistent-v1model· 12 dl12 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
