Emergence in non-neural models: grokking modular arithmetic via average gradient outer product
Neil Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin

TL;DR
This paper demonstrates that the phenomenon of grokking, typically observed in neural networks, also occurs in non-neural models like Recursive Feature Machines using the Average Gradient Outer Product, highlighting feature learning as the core mechanism.
Contribution
The work shows grokking is not exclusive to neural networks and introduces RFM with AGOP as a general framework for understanding emergent behavior through feature learning.
Findings
Grokking occurs in RFM with AGOP, not just neural networks.
Transition to perfect accuracy is driven by feature learning, not loss metrics.
RFM learns block-circulant features implementing Fourier multiplication.
Abstract
Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon where the test accuracy starts improving long after the model achieves 100% training accuracy in the training process. It is often taken as an example of "emergence", where model ability manifests sharply through a phase transition. In this work, we show that the phenomenon of grokking is not specific to neural networks nor to gradient descent-based optimization. Specifically, we show that this phenomenon occurs when learning modular arithmetic with Recursive Feature Machines (RFM), an iterative algorithm that uses the Average Gradient Outer Product (AGOP) to enable task-specific feature learning with general machine learning models. When used in conjunction with kernel machines, iterating RFM results in a fast transition from random, near zero, test accuracy to perfect test accuracy. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
