When Data Falls Short: Grokking Below the Critical Threshold
Vaibhav Singh, Eugene Belilovsky, Rahaf Aljundi

TL;DR
This paper explores how knowledge distillation can induce and accelerate grokking, especially in low-data regimes and when adapting to new distributions, revealing new mechanisms for model generalization.
Contribution
It demonstrates that knowledge distillation from grokked models can enable generalization below the critical data threshold and mitigate forgetting during distribution shifts.
Findings
KD accelerates grokking in low-data regimes.
Distillation enables generalization when training data is insufficient.
KD reduces catastrophic forgetting during continual pretraining.
Abstract
In this paper, we investigate the phenomenon of grokking, where models exhibit delayed generalization following overfitting on training data. We focus on data-scarce regimes where the number of training samples falls below the critical threshold, making grokking unobservable, and on practical scenarios involving distribution shift. We first show that Knowledge Distillation (KD) from a model that has already grokked on a distribution (p1) can induce and accelerate grokking on a different distribution (p2), even when the available data lies below the critical threshold. This highlights the value of KD for deployed models that must adapt to new distributions under limited data. We then study training on the joint distribution (p1, p2) and demonstrate that while standard supervised training fails when either distribution has insufficient data, distilling from models grokked on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis
