Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
Yaniv Nikankin, Anja Reusch, Aaron Mueller, Yonatan Belinkov

TL;DR
Large language models solve arithmetic problems primarily through a set of simple heuristics implemented by specific neurons, rather than learning robust algorithms or memorizing data, as shown by causal analysis and neuron-level examination.
Contribution
This paper reveals that LLMs use a 'bag of heuristics' mechanism for arithmetic, identified through causal neuron analysis, challenging the idea of algorithmic learning or memorization.
Findings
LLMs rely on heuristic neurons for arithmetic reasoning.
A sparse set of neurons implement simple input pattern heuristics.
The heuristic mechanism is prominent early in training.
Abstract
Do large language models (LLMs) solve reasoning tasks by learning robust generalizable algorithms, or do they memorize training data? To investigate this question, we use arithmetic reasoning as a representative task. Using causal analysis, we identify a subset of the model (a circuit) that explains most of the model's behavior for basic arithmetic logic and examine its functionality. By zooming in on the level of individual circuit neurons, we discover a sparse set of important neurons that implement simple heuristics. Each heuristic identifies a numerical input pattern and outputs corresponding answers. We hypothesize that the combination of these heuristic neurons is the mechanism used to produce correct arithmetic answers. To test this, we categorize each neuron into several heuristic types-such as neurons that activate when an operand falls within a certain range-and find that the…
Peer Reviews
Decision·ICLR 2025 Poster
1. Arithmetic tasks are a particularly useful test bed to understand the learning mechanism of LLMs. However, it has been an open question as to what algorithm the model is truly learning. This work makes an important step in this direction, particularly since it applies to the newest edition of LLMs. 2. The bag of heuristics finding explains the lack of length generalization in most LLMs on arithmetic tasks. 3. The experiments and analysis are extensive and the authors make an effort to validat
1. A lot of the analysis assumes that the tokenizers all tokenizers numbers as a single token up to some limit. While I found this interestingly to be true for llama3 and pythia, I don't think this is always true. In fact, the llama2 tokenizer itself was different in that it tokenized each digit individually. I would like to see if these findings are tokenizer specific. 2. The sampling mechanism of activation patching seems to be of fairly "high-variance". Sampling a "random counterfactual promp
This paper applies causal analysis and activation patching experiments to identify individual neuron behaviors. The authors' step-by-step examination, from identifying neurons and classifying them into heuristic types to analyzing their evolution over training time, offers a thorough perspective on LLMs' arithmetic reasoning mechanisms. The paper may facilitate future studies on generalization abilities.
This paper focuses on arithmetic calculation but mentions general reasoning at the beginning. I feel that this could be a overclaim because simple one-step calculation is far away from reasoning. The authors get all conclusions from pre-trained checkpoints. But I am thinking that all models need fine-tuning before usage, it is better to have similar analysis on arithmetically or generally fine-tuned LLMs. The focus is primarily on arithmetic reasoning, which may not fully generalize to other c
- For the most part the paper was well-written and well motivated. At any given point in the paper, it was easy to understand (1) what the hypothesis being tested is, (2) why the authors are testing this hypothesis, (3) the experimental setup, and (4) the results. - The experiments conducted by the authors are thorough and well-support the claims made throughout the paper. - The topic of the paper is timely and illuminates an interesting aspect of arithmetic reasoning in language models that has
- The title of the paper is slightly misleading: “Models Sovle Math with a Bag of Heuristics” in reality the authors only examine arithmetic where both operands have less than three digits. Maybe for greater number of digits, these heuristics are somehow algorithmically combined? This question is left open in the paper (which is fine, I think the paper is a good contribution as is), but the title does not reflect this. - In the very least, I think this point should be mentioned briefly in th
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputability, Logic, AI Algorithms
MethodsSparse Evolutionary Training
