Scaling Behavior for Large Language Models regarding Numeral Systems: An   Example using Pythia

Zhejian Zhou; Jiayu Wang; Dahua Lin; Kai Chen

arXiv:2409.17391·cs.CL·September 30, 2024

Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia

Zhejian Zhou, Jiayu Wang, Dahua Lin, Kai Chen

PDF

Open Access

TL;DR

This paper investigates how different numeral tokenization systems affect the scaling, data efficiency, and numeric reasoning abilities of large language models, revealing base 10's advantages and the models' extrapolation behaviors.

Contribution

It provides empirical analysis of numeral system impacts on LLM performance, highlighting base 10's data efficiency and elucidating model extrapolation patterns in numeric reasoning.

Findings

01

Base 10 tokenization is more data-efficient than base 100 or 1000.

02

Different numeral systems have similar fine-tuning performance.

03

Models struggle with token-level discernment in larger bases.

Abstract

Though Large Language Models (LLMs) have shown remarkable abilities in mathematics reasoning, they are still struggling with performing numeric operations accurately, such as addition and multiplication. Numbers can be tokenized into tokens in various ways by different LLMs and affect the numeric operations performance. Currently, there are two representatives: 1) Tokenize into $1$ -digit, and 2) Tokenize into $1 \sim 3$ digit. The difference is roughly equivalent to using different numeral systems (namely base $10$ or base $1 0^{3}$ ). In light of this, we study the scaling behavior of different numeral systems in the context of transformer-based large language models. We empirically show that a base $10$ system is consistently more data-efficient than a base $1 0^{2}$ or $1 0^{3}$ system across training data scale, model sizes under from-scratch training settings, while different number…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsBalanced Selection