Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training
Aadim Nepal, Safal Shrestha, Anubhav Shrestha, Minwu Kim, Jalal Naghiyev, Ravid Shwartz-Ziv, Keith Ross

TL;DR
This paper shows that mathematical reasoning in large language models relies on specific pre-trained layers that remain important after post-training, with these layers being crucial for math accuracy but less so for factual recall.
Contribution
It demonstrates that key layers for math reasoning are established during pre-training and stay stable, highlighting the importance of these layers for mathematical tasks.
Findings
Critical layers for math reasoning are stable across training methods.
Removing these layers significantly reduces math accuracy.
Tokens drift towards more task-relevant representations near these layers.
Abstract
Large language models improve at math after instruction tuning, reinforcement learning, or knowledge distillation. We ask whether these gains come from major changes in the transformer layers or from smaller adjustments that keep the original structure. Using layer-wise ablation on base and trained variants, we find that math reasoning depends on a few critical layers, which stay important across all post-training methods. Removing these layers reduces math accuracy by as much as 80%, whereas factual recall tasks only show relatively smaller drops. This suggests that specialized layers for mathematical tasks form during pre-training and remain stable afterward. As measured by Normalized Mutual Information (NMI), we find that near these critical layers, tokens drift from their original syntactic clusters toward representations aligned with tokens less syntactically related but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsCognitive and developmental aspects of mathematical skills · Text Readability and Simplification · Topic Modeling
MethodsBalanced Selection
