Progress measures for grokking via mechanistic interpretability

Neel Nanda; Lawrence Chan; Tom Lieberum; Jess Smith; Jacob; Steinhardt

arXiv:2301.05217·cs.LG·October 23, 2023·54 cites

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob, Steinhardt

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper investigates the phenomenon of grokking in neural networks, using mechanistic interpretability to reverse engineer learned algorithms, define progress measures, and analyze the training dynamics as a gradual process rather than a sudden shift.

Contribution

It introduces a mechanistic interpretability approach to understand grokking, reverse engineers the learned Fourier-based algorithm, and defines progress measures to analyze training phases.

Findings

01

Grokking arises from gradual amplification of structured mechanisms.

02

The learned algorithm uses Fourier transforms and trigonometric identities.

03

Training dynamics can be split into memorization, circuit formation, and cleanup phases.

Abstract

Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of ``grokking'' exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mechanistic-interpretability-grokking/progress-measures-paper
pytorchOfficial

Models

🤗
BurnyCoder/grokking-modular-addition-transformer
model

Videos

Progress measures for grokking via mechanistic interpretability· slideslive

Taxonomy

TopicsNeural Networks and Applications · Neural dynamics and brain function · Advanced Memory and Neural Computing