Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia
Zhejian Zhou, Jiayu Wang, Dahua Lin, Kai Chen

TL;DR
This paper investigates how different numeral tokenization systems affect the scaling, data efficiency, and numeric reasoning abilities of large language models, revealing base 10's advantages and the models' extrapolation behaviors.
Contribution
It provides empirical analysis of numeral system impacts on LLM performance, highlighting base 10's data efficiency and elucidating model extrapolation patterns in numeric reasoning.
Findings
Base 10 tokenization is more data-efficient than base 100 or 1000.
Different numeral systems have similar fine-tuning performance.
Models struggle with token-level discernment in larger bases.
Abstract
Though Large Language Models (LLMs) have shown remarkable abilities in mathematics reasoning, they are still struggling with performing numeric operations accurately, such as addition and multiplication. Numbers can be tokenized into tokens in various ways by different LLMs and affect the numeric operations performance. Currently, there are two representatives: 1) Tokenize into -digit, and 2) Tokenize into digit. The difference is roughly equivalent to using different numeral systems (namely base or base ). In light of this, we study the scaling behavior of different numeral systems in the context of transformer-based large language models. We empirically show that a base system is consistently more data-efficient than a base or system across training data scale, model sizes under from-scratch training settings, while different number…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsBalanced Selection
