Critical Data Size of Language Models from a Grokking Perspective
Xuekai Zhu, Yao Fu, Bowen Zhou, Zhouhan Lin

TL;DR
This paper investigates the critical data size needed for language models to transition from memorization to generalization, revealing a size-dependent phase transition and proposing a formal data efficiency hypothesis.
Contribution
It formalizes the phase transition in language model training, introduces the Data Efficiency Hypothesis, and demonstrates how model size influences the critical data threshold for generalization.
Findings
Generalization occurs only after reaching a critical data size.
Larger models require more data to reach the critical size.
Smoother phase transitions are observed at the critical dataset size.
Abstract
We explore the critical data size in language models, a threshold that marks a fundamental shift from quick memorization to slow generalization. We formalize the phase transition under the grokking configuration into the Data Efficiency Hypothesis and identify data insufficiency, sufficiency, and surplus regimes in language models training dynamics. We develop a grokking configuration to reproduce grokking on simplistic language models stably by rescaling initialization and weight decay. We show that generalization occurs only when language models reach a critical size. We analyze grokking across sample-wise and model-wise, verifying the proposed data efficiency hypothesis. Our experiments reveal smoother phase transitions occurring at the critical dataset size for language datasets. As the model size increases, this critical point also becomes larger, indicating that larger models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
