TL;DR
This paper introduces a method for selecting vocabulary size in NLP and sequence modeling by analyzing token frequency distributions through Zipf's law, showing that adherence to Zipfian scaling enhances model performance across multiple domains.
Contribution
The authors propose a principled approach to determine vocabulary size based on Zipf's law, linking token distribution properties to downstream task performance.
Findings
Models perform best when token distributions follow Zipf's law.
Aligning vocabulary size with Zipfian scaling improves efficiency and effectiveness.
The approach generalizes across NLP, genomics, and chemistry.
Abstract
Tokenization is a fundamental step in natural language processing (NLP) and other sequence modeling domains, where the choice of vocabulary size significantly impacts model performance. Despite its importance, selecting an optimal vocabulary size remains underexplored, typically relying on heuristics or dataset-specific choices. In this work, we propose a principled method for determining the vocabulary size by analyzing token frequency distributions through Zipf's law. We show that downstream task performance correlates with how closely token distributions follow power-law behavior, and that aligning with Zipfian scaling improves both model efficiency and effectiveness. Extensive experiments across NLP, genomics, and chemistry demonstrate that models consistently achieve peak performance when the token distribution closely adheres to Zipf's law, establishing Zipfian alignment as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
