How Much is Enough? The Diminishing Returns of Tokenization Training Data

Varshini Reddy; Craig W. Schmidt; Yuval Pinter; Chris Tanner

arXiv:2502.20273·cs.CL·June 17, 2025

How Much is Enough? The Diminishing Returns of Tokenization Training Data

Varshini Reddy, Craig W. Schmidt, Yuval Pinter, Chris Tanner

PDF

Open Access

TL;DR

This study examines how increasing tokenizer training data beyond a certain point yields minimal improvements, revealing a saturation point around 150GB for English and 200GB for Russian, which informs more efficient tokenization practices.

Contribution

It identifies the diminishing returns of tokenizer training data size and analyzes the saturation phenomenon across different languages and tokenization algorithms.

Findings

01

Diminishing returns observed beyond 150GB for English

02

Diminishing returns observed beyond 200GB for Russian

03

Saturation linked to pre-tokenization constraints

Abstract

Tokenization, a crucial initial step in natural language processing, is governed by several key parameters, such as the tokenization algorithm, vocabulary size, pre-tokenization strategy, inference strategy, and training data corpus. This paper investigates the impact of an often-overlooked hyperparameter, tokenizer training data size. We train BPE, UnigramLM, and WordPiece tokenizers across various vocabulary sizes using English training data ranging from 1GB to 900GB. Our findings reveal diminishing returns as training data size increases beyond roughly 150GB, suggesting a practical limit to the improvements in tokenization quality achievable through additional data. We analyze this phenomenon and attribute the saturation effect to constraints introduced by the pre-tokenization stage. We then demonstrate the extent to which these findings can generalize by experimenting on data in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Language and cultural evolution · Authorship Attribution and Profiling

MethodsWordPiece · Byte Pair Encoding