To Grok Grokking: Provable Grokking in Ridge Regression
Mingyue Xu, Gal Vardi, Itay Safran

TL;DR
This paper provides a rigorous theoretical analysis of grokking in ridge regression, demonstrating how generalization improves after overfitting and how hyperparameters influence this process, with implications for deep learning.
Contribution
It offers the first quantitative bounds on grokking time in ridge regression and extends insights to non-linear neural networks through empirical validation.
Findings
Grokking involves a delay between overfitting and generalization.
Proper hyperparameter tuning can amplify or eliminate grokking.
Theoretical bounds on grokking time match empirical observations.
Abstract
We study grokking, the onset of generalization long after overfitting, in a classical ridge regression setting. We prove end-to-end grokking results for learning over-parameterized linear regression models using gradient descent with weight decay. Specifically, we prove that the following stages occur: (i) the model overfits the training data early during training; (ii) poor generalization persists long after overfitting has manifested; and (iii) the generalization error eventually becomes arbitrarily small. Moreover, we show, both theoretically and empirically, that grokking can be amplified or eliminated in a principled manner through proper hyperparameter tuning. To the best of our knowledge, these are the first rigorous quantitative bounds on the generalization delay (which we refer to as the "grokking time") in terms of training hyperparameters. Lastly, going beyond the linear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Face and Expression Recognition
