A path to natural language through tokenisation and transformers
David S. Berman, Alexander G. Stapleton

TL;DR
This paper investigates how byte-pair encoding (BPE) tokenisation influences the statistical properties of language models, showing that deeper BPE levels align model entropy with natural language laws like Zipf's, thus clarifying BPE's role in language representation.
Contribution
It provides a theoretical and empirical analysis of BPE's effect on language statistics, linking tokenisation depth to natural language regularities and model entropy.
Findings
Deeper BPE levels induce Zipfian distribution in token frequencies.
Model entropy predictions align with Zipf's law as BPE depth increases.
Deeper tokenisation reduces local token dependencies, approaching IID conditions.
Abstract
Natural languages exhibit striking regularities in their statistical structure, including notably the emergence of Zipf's and Heaps' laws. Despite this, it remains broadly unclear how these properties relate to the modern tokenisation schemes used in contemporary transformer models. In this note, we analyse the information content (as measured by the Shannon entropy) of various corpora under the assumption of a Zipfian frequency distribution, and derive a closed-form expression for the slot entropy expectation value. We then empirically investigate how byte--pair encoding (BPE) transforms corpus statistics, showing that recursive applications of BPE drive token frequencies toward a Zipfian power law while inducing a characteristic growth pattern in empirical entropy. Utilizing the ability of transformers to learn context dependent token probability distributions, we train language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Authorship Attribution and Profiling · Natural Language Processing Techniques
