Grokking as the Transition from Lazy to Rich Training Dynamics

Tanishq Kumar; Blake Bordelon; Samuel J. Gershman; Cengiz Pehlevan

arXiv:2310.06110·stat.ML·April 12, 2024

Grokking as the Transition from Lazy to Rich Training Dynamics

Tanishq Kumar, Blake Bordelon, Samuel J. Gershman, Cengiz Pehlevan

PDF

Open Access 3 Reviews

TL;DR

This paper explains the grokking phenomenon as a transition from lazy, kernel-like training to rich feature learning in neural networks, highlighting how delayed generalization occurs due to initial misalignment and dataset size.

Contribution

It introduces a new mechanism for grokking based on the transition from lazy to feature learning, supported by analysis of a polynomial regression model and experiments on various architectures.

Findings

01

Grokking results from a shift from kernel to feature learning.

02

Delayed generalization depends on initial feature alignment and dataset size.

03

Transition from lazy to rich training controls grokking in neural networks.

Abstract

We propose that the grokking phenomenon, where the train loss of a neural network decreases much earlier than its test loss, can arise due to a neural network transitioning from lazy training dynamics to a rich, feature learning regime. To illustrate this mechanism, we study the simple setting of vanilla gradient descent on a polynomial regression problem with a two layer neural network which exhibits grokking without regularization in a way that cannot be explained by existing theories. We identify sufficient statistics for the test loss of such a network, and tracking these over training reveals that grokking arises in this setting when the network first attempts to fit a kernel regression solution with its initial features, followed by late-time feature learning where a generalizing solution is identified after train loss is already low. We find that the key determinants of grokking…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

The proposed toy model is interesting and relevant to the ICLR community. Most existing works on grokking considered modular arithmetic tasks, or measured the classification error rather than the surrogate loss used in training. In contrast, the authors studied the regression setting where grokking is manifested in the $L_2$ error, and the training procedure does not involve $\ell_2$ regularization. Also, the connection between grokking and the transition from lazy to rich regime is to my knowle

Weaknesses

I have the following concerns. 1. Given the idealized setting (Gaussian data with identity covariance, single-index target), it is rather underwhelming that the authors did not provide any quantitative characterization of the training dynamics to prove the existence of grokking. Instead, the proposed explanation is only verified empirically, which limits the contribution. Can the authors comment on the technical challenges in analyzing the gradient flow trajectory for this quadratic model? 2

Reviewer 02Rating 3· reject, not good enoughConfidence 3

Strengths

This paper studies the grokking phenomenon in deep learning, which is a recent hot topic and very relevant to ICLR. This paper proposes that grokking can be triggered by the transition from kernel regime to feature learning regime. Though this is already known even before the grokking paper by Power et al, 2022, e.g., the example of quadratically overparametrized linear model in Section 6 of Li et al., 2021, the novelty here is that this paper focuses on vanilla GD. In contrast, the transition b

Weaknesses

1. The definition of grokking seems to be very different than that in literatue. In Power et al., 2022, **Groking** refers to the phenomena that "long after severely overfitting, validation accuracy sometimes suddenly begins to increase from chance level toward perfect generalization". While this paper describes grokking in their introduction as "train loss of a neural network decreases much earlier than its test loss". It is ok to me that this paper only focuses on the regression setting and

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

I think the paper is clean and the point being made is sufficiently important. Also, I find the experiments sufficiently convincing. Crucially, it advances our understanding of the grokking phenomenon. The shortness of the review only reflects the fact that I do not much much criticism for this work.

Weaknesses

Nothing major.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing · Quantum many-body systems