Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan,, Ping Luo, Min Lin, Ngai Wong

TL;DR
This paper demonstrates that larger language models benefit from larger vocabularies, and proposes methods to predict optimal vocabulary sizes that improve model performance and efficiency.
Contribution
It introduces three approaches to determine the compute-optimal vocabulary size for LLMs, showing larger models require larger vocabularies for optimal performance.
Findings
Optimal vocabulary size depends on compute budget.
Most LLMs use smaller vocabularies than optimal.
Increasing vocabulary size improves downstream task performance.
Abstract
Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the conclusion that the optimal vocabulary size depends on the compute budget, with larger models requiring larger vocabularies. Most LLMs, however, use insufficient vocabulary sizes. For example, we predict that the optimal vocabulary size of Llama2-70B should have been at least 216K, 7 times larger than its vocabulary of 32K. We validate our predictions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗sail/scaling-vocab-3b-32k-overtrainmodel· 1 dl1 dl
- 🤗sail/scaling-vocab-3b-43k-overtrainmodel· 1 dl1 dl
- 🤗RichardErkhov/sail_-_scaling-vocab-3b-32k-overtrain-exl2model
- 🤗RichardErkhov/sail_-_scaling-vocab-3b-32k-overtrain-4bitsmodel· 2 dl2 dl
- 🤗RichardErkhov/sail_-_scaling-vocab-3b-32k-overtrain-8bitsmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
