Understanding Grokking Through A Robustness Viewpoint
Zhiquan Tan, Weiran Huang

TL;DR
This paper investigates the grokking phenomenon in neural networks, linking it to robustness and weight norms, and proposes methods and metrics to accelerate and predict grokking based on learning fundamental group properties.
Contribution
It introduces a robustness perspective to understand grokking, links $l_2$ weight norm to grokking, and proposes new metrics for predicting and speeding up the phenomenon.
Findings
$l_2$ weight norm is a sufficient condition for grokking
Perturbation-based methods can accelerate generalization
New robustness and information-theoretic metrics correlate with grokking
Abstract
Recently, an interesting phenomenon called grokking has gained much attention, where generalization occurs long after the models have initially overfitted the training data. We try to understand this seemingly strange phenomenon through the robustness of the neural network. From a robustness perspective, we show that the popular weight norm (metric) of the neural network is actually a sufficient condition for grokking. Based on the previous observations, we propose perturbation-based methods to speed up the generalization process. In addition, we examine the standard training process on the modulo addition dataset and find that it hardly learns other basic group operations before grokking, for example, the commutative law. Interestingly, the speed-up of generalization when using our proposed method can be explained by learning the commutative law, a necessary condition when the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
