Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
Woojin Chung, Jeonghoon Kim

TL;DR
This paper investigates how increasing vocabulary size affects language model training, showing that larger vocabularies reduce text complexity and improve performance mainly by better modeling frequent words, which are crucial for downstream tasks.
Contribution
It provides a controlled study demonstrating that larger vocabularies lower tokenized text complexity and enhance frequent word modeling, clarifying the benefits of vocabulary scaling in language models.
Findings
Larger vocabularies reduce Kolmogorov complexity of tokenized text.
Most benefits come from improved modeling of top 2,500 frequent words.
Enlarging model parameters with fixed vocabulary yields similar benefits.
Abstract
Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but it is unclear where the benefit comes from. To this end, we perform a controlled study that scales the vocabulary of the language model from 24K to 196K while holding data, computation, and optimization unchanged. We begin by quantifying the complexity of tokenized text -- formalized via Kolmogorov complexity -- and show that larger vocabularies reduce this complexity. Above 24K, every common word is already tokenized as a single token, so enlarging vocabulary only deepens the relative token-frequency imbalance. Word-level loss decomposition shows that larger vocabularies reduce cross-entropy loss almost exclusively by lowering uncertainty on the 2,500 most frequent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Language and cultural evolution · Machine Learning in Healthcare
