Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics

Yaniv Nikankin; Anja Reusch; Aaron Mueller; Yonatan Belinkov

arXiv:2410.21272·cs.CL·May 21, 2025·3 cites

Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics

Yaniv Nikankin, Anja Reusch, Aaron Mueller, Yonatan Belinkov

PDF

Open Access 1 Repo 3 Reviews

TL;DR

Large language models solve arithmetic problems primarily through a set of simple heuristics implemented by specific neurons, rather than learning robust algorithms or memorizing data, as shown by causal analysis and neuron-level examination.

Contribution

This paper reveals that LLMs use a 'bag of heuristics' mechanism for arithmetic, identified through causal neuron analysis, challenging the idea of algorithmic learning or memorization.

Findings

01

LLMs rely on heuristic neurons for arithmetic reasoning.

02

A sparse set of neurons implement simple input pattern heuristics.

03

The heuristic mechanism is prominent early in training.

Abstract

Do large language models (LLMs) solve reasoning tasks by learning robust generalizable algorithms, or do they memorize training data? To investigate this question, we use arithmetic reasoning as a representative task. Using causal analysis, we identify a subset of the model (a circuit) that explains most of the model's behavior for basic arithmetic logic and examine its functionality. By zooming in on the level of individual circuit neurons, we discover a sparse set of important neurons that implement simple heuristics. Each heuristic identifies a numerical input pattern and outputs corresponding answers. We hypothesize that the combination of these heuristic neurons is the mechanism used to produce correct arithmetic answers. To test this, we categorize each neuron into several heuristic types-such as neurons that activate when an operand falls within a certain range-and find that the…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. Arithmetic tasks are a particularly useful test bed to understand the learning mechanism of LLMs. However, it has been an open question as to what algorithm the model is truly learning. This work makes an important step in this direction, particularly since it applies to the newest edition of LLMs. 2. The bag of heuristics finding explains the lack of length generalization in most LLMs on arithmetic tasks. 3. The experiments and analysis are extensive and the authors make an effort to validat

Weaknesses

1. A lot of the analysis assumes that the tokenizers all tokenizers numbers as a single token up to some limit. While I found this interestingly to be true for llama3 and pythia, I don't think this is always true. In fact, the llama2 tokenizer itself was different in that it tokenized each digit individually. I would like to see if these findings are tokenizer specific. 2. The sampling mechanism of activation patching seems to be of fairly "high-variance". Sampling a "random counterfactual promp

Reviewer 02Rating 6Confidence 2

Strengths

This paper applies causal analysis and activation patching experiments to identify individual neuron behaviors. The authors' step-by-step examination, from identifying neurons and classifying them into heuristic types to analyzing their evolution over training time, offers a thorough perspective on LLMs' arithmetic reasoning mechanisms. The paper may facilitate future studies on generalization abilities.

Weaknesses

This paper focuses on arithmetic calculation but mentions general reasoning at the beginning. I feel that this could be a overclaim because simple one-step calculation is far away from reasoning. The authors get all conclusions from pre-trained checkpoints. But I am thinking that all models need fine-tuning before usage, it is better to have similar analysis on arithmetically or generally fine-tuned LLMs. The focus is primarily on arithmetic reasoning, which may not fully generalize to other c

Reviewer 03Rating 8Confidence 3

Strengths

- For the most part the paper was well-written and well motivated. At any given point in the paper, it was easy to understand (1) what the hypothesis being tested is, (2) why the authors are testing this hypothesis, (3) the experimental setup, and (4) the results. - The experiments conducted by the authors are thorough and well-support the claims made throughout the paper. - The topic of the paper is timely and illuminates an interesting aspect of arithmetic reasoning in language models that has

Weaknesses

- The title of the paper is slightly misleading: “Models Sovle Math with a Bag of Heuristics” in reality the authors only examine arithmetic where both operands have less than three digits. Maybe for greater number of digits, these heuristics are somehow algorithmically combined? This question is left open in the paper (which is fine, I think the paper is a good contribution as is), but the title does not reflect this. - In the very least, I think this point should be mentioned briefly in th

Code & Models

Repositories

technion-cs-nlp/llm-arithmetic-heuristics
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputability, Logic, AI Algorithms

MethodsSparse Evolutionary Training