Exploring Internal Numeracy in Language Models: A Case Study on ALBERT
Ulme Wennberg, Gustav Eje Henter

TL;DR
This study investigates how ALBERT language models internally represent numerical concepts, revealing they encode basic quantitative reasoning through learned embeddings and PCA analysis.
Contribution
The paper introduces a method to analyze internal numerical representations in language models and demonstrates ALBERT's ability to encode numerical orderings.
Findings
ALBERT models encode numerical orderings along principal axes.
Numerals and textual numbers form separate but similarly oriented clusters.
Language models can implicitly learn basic mathematical concepts.
Abstract
It has been found that Transformer-based language models have the ability to perform basic quantitative reasoning. In this paper, we propose a method for studying how these models internally represent numerical data, and use our proposal to analyze the ALBERT family of language models. Specifically, we extract the learned embeddings these models use to represent tokens that correspond to numbers and ordinals, and subject these embeddings to Principal Component Analysis (PCA). PCA results reveal that ALBERT models of different sizes, trained and initialized separately, consistently learn to use the axes of greatest variation to represent the approximate ordering of various numerical concepts. Numerals and their textual counterparts are represented in separate clusters, but increase along the same direction in 2D space. Our findings illustrate that language models, trained purely to model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Adam · Layer Normalization · Multi-Head Attention · Dense Connections · Residual Connection · Principal Components Analysis · Softmax
