Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
Yiding Song, Hanming Ye

TL;DR
This paper presents an information-theoretic framework explaining grokking as a competition between memorisation and generalisation speeds, both dependent on model capacity, on modular arithmetic tasks.
Contribution
It introduces a formal model linking model size to grokking through measurable timescales, advancing understanding of how capacity influences learning dynamics.
Findings
Grokking occurs near the intersection of memorisation and generalisation timescales.
Larger models memorise faster, consistent with empirical observations.
The framework predicts memorisation speed based on model capacity and dataset complexity.
Abstract
Existing accounts of grokking explain the phenomena in terms of mechanistic frameworks such as circuit efficiency or lazy-to-rich transitions. However, despite a known dependence between grokking and model size, how model capacity shapes grokking remains an open question. We give an information-theoretic account of this relationship on the task of modular arithmetic, showing that grokking does not immediately occur when a model becomes large enough to memorise the training set, but rather emerges as the outcome of a competition between two measurable timescales: a memorisation speed and a generalisation speed , both of which are functions of model parameter count . Adapting the information capacity framework of Morris et al. (2025), we estimate on random-label data of equivalent complexity and on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
