First-Passage Prediction of Grokking Delay: ACalibrated Law under AdamW with Causal Validation
Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, Phan Thanh Duc

TL;DR
This paper presents a quantitative law predicting grokking delay under AdamW, validated across multiple architectures and tasks, with causal interventions confirming the theoretical insights.
Contribution
It introduces the first closed-form prediction of grokking delay as a first-passage time, incorporating AdamW corrections and causal validation.
Findings
Calibrated law predicts grokking delay with 17.7% MAPE on held-out runs.
Law generalizes to MLPs with 18.0% MAPE and cross-task with 23.3% error.
Causal interventions confirm the importance of norm and angular reachability in grokking.
Abstract
We give the first quantitative prediction of grokking delay under AdamW. Treating the delay as a first-passage time, we derive a closed-form law T_grok - T_mem = (1 / 2 kappa_LL eta lambda) log(V_mem / V_star), where V_t = ||theta_t||^2 is the squared parameter norm, V_star is an architecture-dependent threshold, and kappa_LL absorbs the AdamW correction to the clean-SGD contraction rate 2 eta lambda. Calibrating (kappa_LL, V_star) on a single hyperparameter cell predicts grokking delays on 26 held-out runs with MAPE 17.7% over a 41x delay range; the law generalises to MLPs (MAPE 18.0%, N=34) and degrades to 23.3% on cross-task extension (N=46, 43.5x range), with a structured residual in which V_star / V_mem stays comparatively stable within architecture (CV about 14% on the 1L transformer). First-passage of V_t is necessary but not sufficient. A quantile-margin theorem establishes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
