Pre-trained Models Perform the Best When Token Distributions Follow Zipf's Law

Yanjin He; Qingkai Zeng; Meng Jiang

arXiv:2507.22543·cs.LG·July 31, 2025

Pre-trained Models Perform the Best When Token Distributions Follow Zipf's Law

Yanjin He, Qingkai Zeng, Meng Jiang

PDF

1 Video

TL;DR

This paper introduces a method for selecting vocabulary size in NLP and sequence modeling by analyzing token frequency distributions through Zipf's law, showing that adherence to Zipfian scaling enhances model performance across multiple domains.

Contribution

The authors propose a principled approach to determine vocabulary size based on Zipf's law, linking token distribution properties to downstream task performance.

Findings

01

Models perform best when token distributions follow Zipf's law.

02

Aligning vocabulary size with Zipfian scaling improves efficiency and effectiveness.

03

The approach generalizes across NLP, genomics, and chemistry.

Abstract

Tokenization is a fundamental step in natural language processing (NLP) and other sequence modeling domains, where the choice of vocabulary size significantly impacts model performance. Despite its importance, selecting an optimal vocabulary size remains underexplored, typically relying on heuristics or dataset-specific choices. In this work, we propose a principled method for determining the vocabulary size by analyzing token frequency distributions through Zipf's law. We show that downstream task performance correlates with how closely token distributions follow power-law behavior, and that aligning with Zipfian scaling improves both model efficiency and effectiveness. Extensive experiments across NLP, genomics, and chemistry demonstrate that models consistently achieve peak performance when the token distribution closely adheres to Zipf's law, establishing Zipfian alignment as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Pre-trained Models Perform the Best When Token Distributions Follow Zipf’s Law· underline