Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

Yiding Song; Hanming Ye

arXiv:2605.09724·cs.LG·May 12, 2026

Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

Yiding Song, Hanming Ye

PDF

TL;DR

This paper presents an information-theoretic framework explaining grokking as a competition between memorisation and generalisation speeds, both dependent on model capacity, on modular arithmetic tasks.

Contribution

It introduces a formal model linking model size to grokking through measurable timescales, advancing understanding of how capacity influences learning dynamics.

Findings

01

Grokking occurs near the intersection of memorisation and generalisation timescales.

02

Larger models memorise faster, consistent with empirical observations.

03

The framework predicts memorisation speed based on model capacity and dataset complexity.

Abstract

Existing accounts of grokking explain the phenomena in terms of mechanistic frameworks such as circuit efficiency or lazy-to-rich transitions. However, despite a known dependence between grokking and model size, how model capacity shapes grokking remains an open question. We give an information-theoretic account of this relationship on the task of modular arithmetic, showing that grokking does not immediately occur when a model becomes large enough to memorise the training set, but rather emerges as the outcome of a competition between two measurable timescales: a memorisation speed $T_{mem} (P)$ and a generalisation speed $T_{gen} (P)$ , both of which are functions of model parameter count $P$ . Adapting the information capacity framework of Morris et al. (2025), we estimate $T_{mem} (P)$ on random-label data of equivalent complexity and $T_{gen} (P)$ on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.