Toward a Theory of Tokenization in LLMs
Nived Rajaraman, Jiantao Jiao, Kannan Ramchandran

TL;DR
This paper provides a theoretical investigation into how tokenization enhances transformer models' ability to learn complex sequence distributions, demonstrating that proper tokenization allows near-optimal modeling of Markovian data.
Contribution
It offers a theoretical analysis showing that tokenization enables transformers to effectively model higher-order Markov processes, explaining its practical importance.
Findings
Transformers without tokenization tend to learn unigram models.
With tokenization, transformers can model sequences near-optimally.
Tokenization significantly improves the modeling of Markovian data.
Abstract
While there has been a large body of research attempting to circumvent tokenization for language modeling (Clark et al., 2022; Xue et al., 2022), the current consensus is that it is a necessary initial step for designing state-of-the-art performant language models. In this paper, we investigate tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes. When trained on data drawn from certain simple -order Markov processes for , transformers exhibit a surprising phenomenon - in the absence of tokenization, they empirically fail to learn the right distribution and predict characters according to a unigram model (Makkuva et al., 2024). With the addition of tokenization, however, we empirically observe that transformers break through this barrier and are able to model the probabilities of sequences drawn…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
