Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant, Misra

TL;DR
This paper investigates how neural networks can achieve improved generalization on small, algorithmically generated datasets through a process called 'grokking', which occurs beyond overfitting and depends on dataset size.
Contribution
The study introduces the concept of 'grokking' as a phenomenon where neural networks generalize well after overfitting, providing insights into data efficiency and learning dynamics on small datasets.
Findings
Neural networks can 'grok' patterns, leading to perfect generalization after overfitting.
Smaller datasets require more optimization for neural networks to generalize.
Generalization improves significantly after the overfitting phase, beyond memorization.
Abstract
In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Grokking: Generalization beyond Overfitting on small algorithmic datasets (Paper Explained)· youtube
#64 Prof. GARY MARCUS 3.0 [Unplugged]· youtube
Taxonomy
TopicsNeural Networks and Applications · Machine Learning and Data Classification · Machine Learning and Algorithms
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
