When Data Falls Short: Grokking Below the Critical Threshold

Vaibhav Singh; Eugene Belilovsky; Rahaf Aljundi

arXiv:2511.04760·cs.LG·November 10, 2025

When Data Falls Short: Grokking Below the Critical Threshold

Vaibhav Singh, Eugene Belilovsky, Rahaf Aljundi

PDF

Open Access

TL;DR

This paper explores how knowledge distillation can induce and accelerate grokking, especially in low-data regimes and when adapting to new distributions, revealing new mechanisms for model generalization.

Contribution

It demonstrates that knowledge distillation from grokked models can enable generalization below the critical data threshold and mitigate forgetting during distribution shifts.

Findings

01

KD accelerates grokking in low-data regimes.

02

Distillation enables generalization when training data is insufficient.

03

KD reduces catastrophic forgetting during continual pretraining.

Abstract

In this paper, we investigate the phenomenon of grokking, where models exhibit delayed generalization following overfitting on training data. We focus on data-scarce regimes where the number of training samples falls below the critical threshold, making grokking unobservable, and on practical scenarios involving distribution shift. We first show that Knowledge Distillation (KD) from a model that has already grokked on a distribution (p1) can induce and accelerate grokking on a different distribution (p2), even when the available data lies below the critical threshold. This highlights the value of KD for deployed models that must adapt to new distributions under limited data. We then study training on the joint distribution (p1, p2) and demonstrate that while standard supervised training fails when either distribution has insufficient data, distilling from models grokked on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis