Learning the greatest common divisor: explaining transformer predictions
Fran\c{c}ois Charton

TL;DR
This paper investigates how small transformers learn to compute the GCD of two integers, revealing that their predictions are based on learned divisibility patterns and how training data distribution affects both performance and interpretability.
Contribution
It provides a detailed characterization of transformer behavior in GCD computation and shows how training data distribution influences both accuracy and explainability.
Findings
Transformers predict the largest divisor from a learned list of products.
Training with uniform operands limits GCD learning to small values.
Log-uniform training distributions significantly improve GCD prediction accuracy.
Abstract
The predictions of small transformers, trained to calculate the greatest common divisor (GCD) of two positive integers, can be fully characterized by looking at model inputs and outputs. As training proceeds, the model learns a list of integers, products of divisors of the base used to represent integers and small primes, and predicts the largest element of that divides both inputs. Training distributions impact performance. Models trained from uniform operands only learn a handful of GCD (up to GCD ). Log-uniform operands boost performance to GCD , and a log-uniform distribution of outcomes (i.e. GCD) to . However, training from uniform (balanced) GCD breaks explainability.
Peer Reviews
Decision·ICLR 2024 spotlight
1. The main idea is pretty straightforward and most of the concepts in the paper are well explained (see concerns discussed below). 2. To my knowledge, computing and analyzing GCD prediction via transformers has not been done before. 3. Very comprehensive suite of experiments and exhaustive analysis. I appreciate the authors performing such a wide range of experiments. I also really like the nice link between theoretical accuracy and practical accuracy provided in Appendix C. 4. Claims made by
My main concern about this work lies with the significance of the results and observations made. The authors train a transformer to predict the GCD which seems to work fairly well with some tricks in picking the right dataset. However, I’m not convinced about why this result would be of significant interest to the wider community and what it says about the representational power of transformer themselves beyond the narrow context of learning GCDs. Like I mentioned earlier, the experiments are th
I really enjoyed reading this paper. In most cases, the algorithms learned by Transformers do not seem interpretable. I was surprised by how structured the encoded algorithm is in the case of GCD computation. This structure is very well explored in the paper through cleverly designed experiments and the intuition behind the results is also explained well.
I don't see any substantial weakness. Perhaps one could say that the implications of these results for large language models are unclear. However, we barely know anything about the internal workings of Transformer models (the architecture behind LLMs) and this paper makes a small (and interesting!) step in enhancing our understanding.
- The particular attention to the distribution of the operands is important, and it's shown that by emphasizing on small numbers, the performance of the model can be increased (resembling curriculum learning). - Similarly, authors have considered the distribution of the output showing that large GCDs may be slow/hard to learn as they are rare. - Experiments are rather extensive in several axes (e.g., number of bases, size of the models, batch-size, ...).
- The claim that Transformer predictions are fully explainable does not seem to be accurate once log-uniform operands are used. - Although, the paper usually interprets the results of experiments, it does not put forward any explanation for such results (for example, defining and justifying the shortcuts Transformers may take). - Some of the report rules might be consequences of other factors (and they may not be robust as a result). For example, the smaller primes are learned (or grokked) fas
Code & Models
Videos
Taxonomy
TopicsRings, Modules, and Algebras
MethodsBalanced Selection
