Understanding Grokking Through A Robustness Viewpoint

Zhiquan Tan; Weiran Huang

arXiv:2311.06597·cs.LG·February 5, 2024·2 cites

Understanding Grokking Through A Robustness Viewpoint

Zhiquan Tan, Weiran Huang

PDF

Open Access

TL;DR

This paper investigates the grokking phenomenon in neural networks, linking it to robustness and weight norms, and proposes methods and metrics to accelerate and predict grokking based on learning fundamental group properties.

Contribution

It introduces a robustness perspective to understand grokking, links $l_2$ weight norm to grokking, and proposes new metrics for predicting and speeding up the phenomenon.

Findings

01

$l_2$ weight norm is a sufficient condition for grokking

02

Perturbation-based methods can accelerate generalization

03

New robustness and information-theoretic metrics correlate with grokking

Abstract

Recently, an interesting phenomenon called grokking has gained much attention, where generalization occurs long after the models have initially overfitted the training data. We try to understand this seemingly strange phenomenon through the robustness of the neural network. From a robustness perspective, we show that the popular $l_{2}$ weight norm (metric) of the neural network is actually a sufficient condition for grokking. Based on the previous observations, we propose perturbation-based methods to speed up the generalization process. In addition, we examine the standard training process on the modulo addition dataset and find that it hardly learns other basic group operations before grokking, for example, the commutative law. Interestingly, the speed-up of generalization when using our proposed method can be explained by learning the commutative law, a necessary condition when the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings