TL;DR
This paper investigates how regularization influences grokking, revealing that explicit or implicit regularization can induce delayed generalization in neural networks, especially with over-parameterization and data selection effects.
Contribution
It demonstrates that regularization targeting specific properties can induce grokking, and over-parameterization enables grokking without explicit regularization, challenging traditional norms as proxies for generalization.
Findings
Regularization of property P induces grokking.
Over-parameterization enables grokking without explicit regularization.
L2 norm is unreliable as a proxy for generalization.
Abstract
Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. In this work, we demonstrate that grokking can be induced by regularization, either explicit or implicit. More precisely, we show that when there exists a model with a property (e.g., sparse or low-rank weights) that generalizes on the problem of interest, gradient descent with a small but non-zero regularization of (e.g., or nuclear norm regularization) results in grokking. This extends previous work showing that small non-zero weight decay induces grokking. Moreover, our analysis shows that over-parameterization by adding depth makes it possible to grok or ungrok without explicitly using regularization, which is impossible in shallow cases. We further show that the norm is not a reliable proxy for generalization when the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
