Estimating Numbers without Regression
Avijit Thawani, Jay Pujara, Ashwin Kalyan

TL;DR
This paper shows that simple tokenization schemes can significantly improve a language model's ability to estimate numbers, outperforming complex architectural modifications in certain tasks.
Contribution
It demonstrates that changing the vocabulary and tokenization approach is more effective and simpler than architectural changes for number estimation in language models.
Findings
Tokenization-based methods perform on par with architectural changes.
Vocabulary modifications improve number estimation accuracy.
Simple tokenization schemes are effective for numerical fact estimation.
Abstract
Despite recent successes in language models, their ability to represent numbers is insufficient. Humans conceptualize numbers based on their magnitudes, effectively projecting them on a number line; whereas subword tokenization fails to explicitly capture magnitude by splitting numbers into arbitrary chunks. To alleviate this shortcoming, alternative approaches have been proposed that modify numbers at various stages of the language modeling pipeline. These methods change either the (1) notation in which numbers are written (\eg scientific vs decimal), the (2) vocabulary used to represent numbers or the entire (3) architecture of the underlying language model, to directly regress to a desired number. Previous work suggests that architectural change helps achieve state-of-the-art on number estimation but we find an insightful ablation: changing the model's vocabulary instead (\eg…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
