What Can Grokking Teach Us About Learning Under Nonstationarity?

Clare Lyle; Gharda Sokar; Razvan Pascanu; Andras Gyorgy

arXiv:2507.20057·cs.LG·July 29, 2025

What Can Grokking Teach Us About Learning Under Nonstationarity?

Clare Lyle, Gharda Sokar, Razvan Pascanu, Andras Gyorgy

PDF

TL;DR

This paper explores how feature-learning dynamics, exemplified by grokking, can be leveraged to improve continual learning by overcoming primacy bias through increased effective learning rates.

Contribution

It introduces a simple method to induce feature-learning dynamics via higher effective learning rates, enhancing generalization in various nonstationary learning scenarios.

Findings

01

Increased effective learning rate accelerates grokking.

02

Method improves generalization in continual learning tasks.

03

Approach benefits reinforcement learning and warm-start training.

Abstract

In continual learning problems, it is often necessary to overwrite components of a neural network's learned representation in response to changes in the data stream; however, neural networks often exhibit \primacy bias, whereby early training data hinders the network's ability to generalize on later tasks. While feature-learning dynamics of nonstationary learning problems are not well studied, the emergence of feature-learning dynamics is known to drive the phenomenon of grokking, wherein neural networks initially memorize their training data and only later exhibit perfect generalization. This work conjectures that the same feature-learning dynamics which facilitate generalization in grokking also underlie the ability to overwrite previous learned features as well, and methods which accelerate grokking by facilitating feature-learning dynamics are promising candidates for addressing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.