Deep Grokking: Would Deep Neural Networks Generalize Better?
Simin Fan, Razvan Pascanu, Martin Jaggi

TL;DR
This paper investigates the grokking phenomenon in deep neural networks, revealing that deeper models are more susceptible to grokking, exhibit multi-stage generalization, and that feature rank dynamics can predict generalization better than weight norms.
Contribution
It is the first study to explore grokking in deep networks, linking feature rank changes to generalization and revealing multi-stage phenomena in deep models.
Findings
Deep networks are more susceptible to grokking than shallow ones.
Multi-stage generalization with secondary accuracy surges occurs in deep models.
Feature rank dynamics correlate with phase transitions in generalization.
Abstract
Recent research on the grokking phenomenon has illuminated the intricacies of neural networks' training dynamics and their generalization behaviors. Grokking refers to a sharp rise of the network's generalization accuracy on the test set, which occurs long after an extended overfitting phase, during which the network perfectly fits the training set. While the existing research primarily focus on shallow networks such as 2-layer MLP and 1-layer Transformer, we explore grokking on deep networks (e.g. 12-layer MLP). We empirically replicate the phenomenon and find that deep neural networks can be more susceptible to grokking than its shallower counterparts. Meanwhile, we observe an intriguing multi-stage generalization phenomenon when increase the depth of the MLP model where the test accuracy exhibits a secondary surge, which is scarcely seen on shallow models. We further uncover…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
