Grokking Explained: A Statistical Phenomenon

Breno W. Carvalho; Artur S. d'Avila Garcez; Lu\'is C. Lamb and; Em\'ilio Vital Brazil

arXiv:2502.01774·cs.LG·February 5, 2025

Grokking Explained: A Statistical Phenomenon

Breno W. Carvalho, Artur S. d'Avila Garcez, Lu\'is C. Lamb and, Em\'ilio Vital Brazil

PDF

Open Access

TL;DR

This paper investigates the grokking phenomenon in deep learning, revealing that distribution shifts between training and test data are key, and introduces synthetic datasets to analyze its causes and mechanisms.

Contribution

It formalizes grokking, demonstrates its relation to distribution shifts, and shows it can occur with dense data and minimal tuning, advancing understanding of this phenomenon.

Findings

01

Grokking is linked to distribution shifts between training and test data.

02

Small-sampling facilitates grokking but is not its primary cause.

03

Grokking can occur with dense data and minimal hyper-parameter tuning.

Abstract

Grokking, or delayed generalization, is an intriguing learning phenomenon where test set loss decreases sharply only after a model's training set loss has converged. This challenges conventional understanding of the training dynamics in deep learning networks. In this paper, we formalize and investigate grokking, highlighting that a key factor in its emergence is a distribution shift between training and test data. We introduce two synthetic datasets specifically designed to analyze grokking. One dataset examines the impact of limited sampling, and the other investigates transfer learning's role in grokking. By inducing distribution shifts through controlled imbalanced sampling of sub-categories, we systematically reproduce the phenomenon, demonstrating that while small-sampling is strongly associated with grokking, it is not its cause. Instead, small-sampling serves as a convenient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis

MethodsSparse Evolutionary Training