Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition

Mohamad Amin Mohamadi; Zhiyuan Li; Lei Wu; Danica J. Sutherland

arXiv:2407.12332·cs.LG·July 18, 2024·1 cites

Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition

Mohamad Amin Mohamadi, Zhiyuan Li, Lei Wu, Danica J. Sutherland

PDF

Open Access

TL;DR

This paper provides a theoretical explanation for the grokking phenomenon, showing how models transition from kernel-like behavior to better generalization after overfitting, especially in modular addition tasks.

Contribution

It introduces a theoretical framework explaining grokking, highlighting the transition from kernel regime to limiting behavior in gradient descent on deep networks.

Findings

01

Kernel regime limits early generalization on modular addition.

02

Two-layer quadratic networks can generalize with fewer data points.

03

Models leave the kernel regime after initial overfitting.

Abstract

We present a theoretical explanation of the ``grokking'' phenomenon, where a model generalizes long after overfitting,for the originally-studied problem of modular addition. First, we show that early in gradient descent, when the ``kernel regime'' approximately holds, no permutation-equivariant model can achieve small population error on modular addition unless it sees at least a constant fraction of all possible data points. Eventually, however, models escape the kernel regime. We show that two-layer quadratic networks that achieve zero training loss with bounded $ℓ_{\infty}$ norm generalize well with substantially fewer training points, and further show such networks exist and can be found by gradient descent with small $ℓ_{\infty}$ regularization. We further provide empirical evidence that these networks as well as simple Transformers, leave the kernel regime only after…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques