Grokking as a First Order Phase Transition in Two Layer Networks

Noa Rubin; Inbar Seroussi; Zohar Ringel

arXiv:2310.03789·stat.ML·May 7, 2024·1 cites

Grokking as a First Order Phase Transition in Two Layer Networks

Noa Rubin, Inbar Seroussi, Zohar Ringel

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper models the Grokking phenomenon in deep neural networks as a first-order phase transition, providing analytical insights and linking feature learning to phase transition theory.

Contribution

It introduces a theoretical framework connecting Grokking to phase transitions using the adaptive kernel approach in teacher-student models.

Findings

01

Grokking corresponds to a first-order phase transition.

02

Post-Grokking, the network's internal representations are sharply distinct.

03

Analytical predictions match observed Grokking behavior.

Abstract

A key property of deep neural networks (DNNs) is their ability to learn new features during training. This intriguing aspect of deep learning stands out most clearly in recently reported Grokking phenomena. While mainly reflected as a sudden increase in test accuracy, Grokking is also believed to be a beyond lazy-learning/Gaussian Process (GP) phenomenon involving feature learning. Here we apply a recent development in the theory of feature learning, the adaptive kernel approach, to two teacher-student models with cubic-polynomial and modular addition teachers. We provide analytical predictions on feature learning and Grokking properties of these models and demonstrate a mapping between Grokking and the theory of phase transitions. We show that after Grokking, the state of the DNN is analogous to the mixed phase following a first-order phase transition. In this mixed phase, the DNN…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 3· reject, not good enoughConfidence 3

Strengths

* This paper seems to have made some efforts to explain grokking, though I cannot fully understand them.

Weaknesses

* This paper is poorly written. It claims that it is using an "adaptive kernel approach" to explain grokking, but they never explains what this method is. I tried to read the previous works, but they are not easy to read for ML audience, either. I urge the authors to introduce the background better: What is the "adaptive kernel"? Why is it important to study "action"? Why does the approximation in the paragraph beginning with "Next," in Page 4 make sense? What is the theory of phase transition i

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- The formal treatment of Grokking using the approach proposed appears to be novel and compelling. - Understanding the phenomenon through the lens of feature vs. lazy learning and casting it in the language of phase transitions is a direction of research that many prior works speculated about. This paper presents much interesting work in this direction.

Weaknesses

- The connection to Grokking, as observed in prior work across various tasks and modalities, seems almost secondary in this paper. This work focuses on some toy models, solves them, and refers very little to the general phenomenon of delayed generalization. The general feeling I have from this paper is that the approach somewhat obscures the contributions and implications. The formalism developed here and in the referenced work seems compelling, yet the paper does not go beyond two simple toy m

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

This paper extends a promising approach to neural network theory, known as the adaptive kernel approach, which studies how the kernels of deep networks adapt to data after feature learning. This paper provides two interesting case studies (polynomial regression and modular arithmetic) where they make progress on deriving an effective action which depends only on overlaps with the teacher direction $w^*$ or $v$. They show that the derived theory is accurate in simulations of networks on these lea

Weaknesses

While the phenomena described and the resulting theoretical picture of 3 phases is quite impressive, I am not sure that this transition constitutes grokking as it is usually understood where training loss decreases much earlier than test loss during gradient based learning dynamics. I do not see this as a fundamental limitation of the paper (which I quite appreciate) but mainly as an issue of framing. In my opinion this work is a more fundamental phenomenon than grokking since it pertains to fu

Videos

Grokking as a First Order Phase Transition in Two Layer Networks· slideslive

Taxonomy

TopicsNeural Networks and Applications · Gaussian Processes and Bayesian Inference · Statistical Mechanics and Entropy