Grokking as a First Order Phase Transition in Two Layer Networks
Noa Rubin, Inbar Seroussi, Zohar Ringel

TL;DR
This paper models the Grokking phenomenon in deep neural networks as a first-order phase transition, providing analytical insights and linking feature learning to phase transition theory.
Contribution
It introduces a theoretical framework connecting Grokking to phase transitions using the adaptive kernel approach in teacher-student models.
Findings
Grokking corresponds to a first-order phase transition.
Post-Grokking, the network's internal representations are sharply distinct.
Analytical predictions match observed Grokking behavior.
Abstract
A key property of deep neural networks (DNNs) is their ability to learn new features during training. This intriguing aspect of deep learning stands out most clearly in recently reported Grokking phenomena. While mainly reflected as a sudden increase in test accuracy, Grokking is also believed to be a beyond lazy-learning/Gaussian Process (GP) phenomenon involving feature learning. Here we apply a recent development in the theory of feature learning, the adaptive kernel approach, to two teacher-student models with cubic-polynomial and modular addition teachers. We provide analytical predictions on feature learning and Grokking properties of these models and demonstrate a mapping between Grokking and the theory of phase transitions. We show that after Grokking, the state of the DNN is analogous to the mixed phase following a first-order phase transition. In this mixed phase, the DNN…
Peer Reviews
Decision·ICLR 2024 poster
* This paper seems to have made some efforts to explain grokking, though I cannot fully understand them.
* This paper is poorly written. It claims that it is using an "adaptive kernel approach" to explain grokking, but they never explains what this method is. I tried to read the previous works, but they are not easy to read for ML audience, either. I urge the authors to introduce the background better: What is the "adaptive kernel"? Why is it important to study "action"? Why does the approximation in the paragraph beginning with "Next," in Page 4 make sense? What is the theory of phase transition i
- The formal treatment of Grokking using the approach proposed appears to be novel and compelling. - Understanding the phenomenon through the lens of feature vs. lazy learning and casting it in the language of phase transitions is a direction of research that many prior works speculated about. This paper presents much interesting work in this direction.
- The connection to Grokking, as observed in prior work across various tasks and modalities, seems almost secondary in this paper. This work focuses on some toy models, solves them, and refers very little to the general phenomenon of delayed generalization. The general feeling I have from this paper is that the approach somewhat obscures the contributions and implications. The formalism developed here and in the referenced work seems compelling, yet the paper does not go beyond two simple toy m
This paper extends a promising approach to neural network theory, known as the adaptive kernel approach, which studies how the kernels of deep networks adapt to data after feature learning. This paper provides two interesting case studies (polynomial regression and modular arithmetic) where they make progress on deriving an effective action which depends only on overlaps with the teacher direction $w^*$ or $v$. They show that the derived theory is accurate in simulations of networks on these lea
While the phenomena described and the resulting theoretical picture of 3 phases is quite impressive, I am not sure that this transition constitutes grokking as it is usually understood where training loss decreases much earlier than test loss during gradient based learning dynamics. I do not see this as a fundamental limitation of the paper (which I quite appreciate) but mainly as an issue of framing. In my opinion this work is a more fundamental phenomenon than grokking since it pertains to fu
Videos
Taxonomy
TopicsNeural Networks and Applications · Gaussian Processes and Bayesian Inference · Statistical Mechanics and Entropy
