Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test
Ziyue Li, Chenrui Fan, Tianyi Zhou

TL;DR
This study investigates the emergence of grokking in large language model pretraining, revealing how models transition from memorization to generalization and proposing low-cost metrics to monitor this process.
Contribution
It is the first to analyze grokking in practical LLM pretraining, especially in mixture-of-experts models, and introduces data pathway metrics for monitoring generalization.
Findings
Grokking occurs in MoE LLM pretraining with asynchronous local stages.
Training data pathways evolve from random to structured, indicating a memorization-to-generalization transition.
Proposed metrics effectively track model generalization without costly evaluations.
Abstract
This paper presents the first study of grokking in practical LLM pretraining. Specifically, we investigate when an LLM memorizes the training data, when its generalization on downstream tasks starts to improve, and what happens if there is a lag between the two. Unlike existing works studying when a small model generalizes to limited and specified tasks during thousands epochs' training on algorithmic data, we focus on a practical setting for LLMs, i.e., one-epoch pretraining of next-token prediction on a cross-domain, large-scale corpus, and generalization on diverse benchmark tasks covering math/commonsense reasoning, code generation, and domain-specific retrieval. Our study, for the first time, verifies that grokking still emerges in pretraining mixture-of-experts (MoE) LLMs, though different local data groups may enter their grokking stages asynchronously due to the heterogeneity of…
Peer Reviews
Decision·ICLR 2026 Poster
Previous studies of grokking mostly used small models trained on synthetic algorithmic tasks, but this work examines a 7‑billion‑parameter mixture‑of‑experts (MoE) model (OLMoE) and shows that grokking still appears in one‑epoch pretraining. The observation that that grokking in LLMs are local and asynchronous broadens our understanding of grokking at scale. The metrics they define; pathway edit distance and a per sample pathway consistency, rely only on pretraining data and internal activatio
The conversion of top‑k experts per layer into comma‑separated strings and computing Levenshtein distance is ad‑hoc, since edit distance is sensitive to sequence length and arbitrary thresholding. This distance can also decrease simply due to stronger load‑balancing or saturated routers. The bound assumes fixed routing and an NTK regime for a one‑layer MoE, while in practice OLMoE updates routing and experts jointly across many layers for trillions of tokens.
* Zero-cost, pathway-based indicators derived from MoE routing dynamics The paper proposes two metrics, sample-to-sample pathway similarity and across-layer pathway consistency, that can track the rise of downstream generalization without instruction tuning or benchmark evaluations. LLM evaluation is expensive. Leveraging internal routing information to estimate generalization progress directly during pretraining is highly cost-effective. * Discovery of “local grokking” under data heterogenei
* Limited empirical scope (single model) I understand there are no other publicly available MoE checkpoints, but the paper's results and discussion are tailored to the specific OLMoE 7B model. Ideally, robustness should be assessed across a broad range of choices in optimization, model design, and training data, such as learning rate schedules, model scales, expert capacity and number, and data mixtures. Otherwise, the paper's results and findings may be model-specific or biased by the model u
1. The author uses realistic setting like One-epoch, heterogeneous web-scale data, public 7B MoE checkpoints, and diverse domains. 2. The author provides evidence of asynchronous memorization and delayed generalization, including matched training/test groups and domain-dependent lags. 3. The author uses public checkpoints/datasets and explicit data-contamination filtering which has good reproducibility.
1. The paper mentions “virtual pathways” for dense models as future work; however, the core claim (test-free generalization monitoring) would be much stronger with experiment on a small-scale dense model. 2. Consider replacing “zero-cost” with “near-zero-cost” or “cheap to compute” to avoid overstatement.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Materials Science
