Grokking vs. Learning: Same Features, Different Encodings
Dmitry Manning-Coe, Jacopo Gliozzi, Alexander G. Stapleton, Edward, Hirst, Giuseppe De Tomasi, Barry Bradlyn, and David S. Berman

TL;DR
This paper compares grokking and steady training in neural networks, revealing they learn similar features but differ in encoding efficiency, with a new compressive regime identified in steady training.
Contribution
It introduces a detailed comparison of grokking and steady training, discovering a novel compressive regime and analyzing the development of features and compressibility during training.
Findings
Grokking and steady training learn the same features.
Steady training exhibits a unique compressive regime with high compression factors.
Models in grokking follow a straight path in information space.
Abstract
Grokking typically achieves similar loss to ordinary, "steady", learning. We ask whether these different learning paths - grokking versus ordinary training - lead to fundamental differences in the learned models. To do so we compare the features, compressibility, and learning dynamics of models trained via each path in two tasks. We find that grokked and steadily trained models learn the same features, but there can be large differences in the efficiency with which these features are encoded. In particular, we find a novel "compressive regime" of steady training in which there emerges a linear trade-off between model loss and compressibility, and which is absent in grokking. In this regime, we can achieve compression factors 25x times the base model, and 5x times the compression achieved in grokking. We then track how model features and compressibility develop through training. We show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Neural Networks and Reservoir Computing · Stochastic Gradient Optimization Techniques
MethodsBalanced Selection
