What Can Grokking Teach Us About Learning Under Nonstationarity?
Clare Lyle, Gharda Sokar, Razvan Pascanu, Andras Gyorgy

TL;DR
This paper explores how feature-learning dynamics, exemplified by grokking, can be leveraged to improve continual learning by overcoming primacy bias through increased effective learning rates.
Contribution
It introduces a simple method to induce feature-learning dynamics via higher effective learning rates, enhancing generalization in various nonstationary learning scenarios.
Findings
Increased effective learning rate accelerates grokking.
Method improves generalization in continual learning tasks.
Approach benefits reinforcement learning and warm-start training.
Abstract
In continual learning problems, it is often necessary to overwrite components of a neural network's learned representation in response to changes in the data stream; however, neural networks often exhibit \primacy bias, whereby early training data hinders the network's ability to generalize on later tasks. While feature-learning dynamics of nonstationary learning problems are not well studied, the emergence of feature-learning dynamics is known to drive the phenomenon of grokking, wherein neural networks initially memorize their training data and only later exhibit perfect generalization. This work conjectures that the same feature-learning dynamics which facilitate generalization in grokking also underlie the ability to overwrite previous learned features as well, and methods which accelerate grokking by facilitating feature-learning dynamics are promising candidates for addressing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
