How Much is Enough? The Diminishing Returns of Tokenization Training Data
Varshini Reddy, Craig W. Schmidt, Yuval Pinter, Chris Tanner

TL;DR
This study examines how increasing tokenizer training data beyond a certain point yields minimal improvements, revealing a saturation point around 150GB for English and 200GB for Russian, which informs more efficient tokenization practices.
Contribution
It identifies the diminishing returns of tokenizer training data size and analyzes the saturation phenomenon across different languages and tokenization algorithms.
Findings
Diminishing returns observed beyond 150GB for English
Diminishing returns observed beyond 200GB for Russian
Saturation linked to pre-tokenization constraints
Abstract
Tokenization, a crucial initial step in natural language processing, is governed by several key parameters, such as the tokenization algorithm, vocabulary size, pre-tokenization strategy, inference strategy, and training data corpus. This paper investigates the impact of an often-overlooked hyperparameter, tokenizer training data size. We train BPE, UnigramLM, and WordPiece tokenizers across various vocabulary sizes using English training data ranging from 1GB to 900GB. Our findings reveal diminishing returns as training data size increases beyond roughly 150GB, suggesting a practical limit to the improvements in tokenization quality achievable through additional data. We analyze this phenomenon and attribute the saturation effect to constraints introduced by the pre-tokenization stage. We then demonstrate the extent to which these findings can generalize by experimenting on data in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Language and cultural evolution · Authorship Attribution and Profiling
MethodsWordPiece · Byte Pair Encoding
