Progress measures for grokking via mechanistic interpretability
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob, Steinhardt

TL;DR
This paper investigates the phenomenon of grokking in neural networks, using mechanistic interpretability to reverse engineer learned algorithms, define progress measures, and analyze the training dynamics as a gradual process rather than a sudden shift.
Contribution
It introduces a mechanistic interpretability approach to understand grokking, reverse engineers the learned Fourier-based algorithm, and defines progress measures to analyze training phases.
Findings
Grokking arises from gradual amplification of structured mechanisms.
The learned algorithm uses Fourier transforms and trigonometric identities.
Training dynamics can be split into memorization, circuit formation, and cleanup phases.
Abstract
Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of ``grokking'' exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications · Neural dynamics and brain function · Advanced Memory and Neural Computing
