Grokking: Generalization Beyond Overfitting on Small Algorithmic   Datasets

Alethea Power; Yuri Burda; Harri Edwards; Igor Babuschkin; Vedant; Misra

arXiv:2201.02177·cs.LG·January 7, 2022·78 cites

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant, Misra

PDF

Open Access 5 Repos 1 Models 2 Videos

TL;DR

This paper investigates how neural networks can achieve improved generalization on small, algorithmically generated datasets through a process called 'grokking', which occurs beyond overfitting and depends on dataset size.

Contribution

The study introduces the concept of 'grokking' as a phenomenon where neural networks generalize well after overfitting, providing insights into data efficiency and learning dynamics on small datasets.

Findings

01

Neural networks can 'grok' patterns, leading to perfect generalization after overfitting.

02

Smaller datasets require more optimization for neural networks to generalize.

03

Generalization improves significantly after the overfitting phase, beyond memorization.

Abstract

In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
BurnyCoder/grokking-modular-addition-transformer
model

Videos

Grokking: Generalization beyond Overfitting on small algorithmic datasets (Paper Explained)· youtube

#64 Prof. GARY MARCUS 3.0 [Unplugged]· youtube

Taxonomy

TopicsNeural Networks and Applications · Machine Learning and Data Classification · Machine Learning and Algorithms

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings