The Foundations of Tokenization: Statistical and Computational Concerns
Juan Luis Gastaldi, John Terilla, Luca Malagutti, Brian DuSell, Tim, Vieira, Ryan Cotterell

TL;DR
This paper develops a formal framework for understanding tokenization in NLP, addressing its theoretical foundations, statistical properties, and computational concerns to improve language model robustness.
Contribution
It introduces a unified formal framework based on stochastic maps to analyze tokenizers and establish conditions for their consistency and reliability.
Findings
Provides necessary and sufficient conditions for tokenizer consistency
Analyzes statistical and computational issues like ambiguity and finiteness
Lays groundwork for more robust neural language representations
Abstract
Tokenization - the practice of converting strings of characters from an alphabet into sequences of tokens over a vocabulary - is a critical step in the NLP pipeline. The use of token representations is widely credited with increased model performance but is also the source of many undesirable behaviors, such as spurious ambiguity or inconsistency. Despite its recognized importance as a standard representation method in NLP, the theoretical underpinnings of tokenization are not yet fully understood. In particular, the impact of tokenization on language model estimation has been investigated primarily through empirical means. The present paper contributes to addressing this theoretical gap by proposing a unified formal framework for representing and analyzing tokenizer models. Based on the category of stochastic maps, this framework enables us to establish general conditions for a…
Peer Reviews
Decision·ICLR 2025 Poster
The paper has several strengths: - Tokenization is a critical aspect of modern-day natural language processing, but its theoretical underpinnings are not yet fully understood. The formalisms introduced in the paper help close this gap and might become the basis for future work. - The application of stochastic maps to tokenization is novel. - The presentation is excellent; the writing is clear and overall easy to follow.
This is a completely theoretical paper without any empirical evaluation. While not a weakness _per se_, the authors mention that their findings have implications for the practical use of tokenizers (e.g., line 95). Compared to the version of the paper that was under submission at NeurIPS 2024, the authors have added sections discussing practical aspects of their observations (e.g., lines 345-350), but I still believe that a proper case study showcasing the practical value of the proposed formal
The paper presents a novel framework for representing and analyzing tokenizers in the form of stochastic maps. The proposed framework has the potential to provide more a foundational statistical understanding of tokenizer behavior, which may lead to practical improvements.
The paper provides some examples of how the framework can be utilized to shed new light on tokenizer behavior such as ambiguity and inconsistency, but it is not immediately clear to what extent the proposed framework can lead to practical improvements of tokenizer performance.
— Theoretical position paper is from a foundational perspective, and mathematically justified analysis is introduced; strict definitions and notation conventions are presented.
It's still unclear how to apply this knowledge to practical issues and better tokenizers for LLMs.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Chromatin Dynamics
