To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets
Darshil Doshi, Aritra Das, Tianyu He, Andrey Gromov

TL;DR
This paper investigates how neural networks can memorize corrupted data while still generalizing well, and how regularization techniques help promote generalization by suppressing memorization, with insights into training dynamics and neuron behavior.
Contribution
It provides an interpretable model demonstrating the coexistence of memorization and generalization, and explains how regularization methods influence this balance in neural networks.
Findings
Networks can memorize corrupted labels and still generalize.
Pruning memorizing neurons improves uncorrupted data accuracy.
Regularization methods promote generalization by suppressing memorization.
Abstract
Robust generalization is a major challenge in deep learning, particularly when the number of trainable parameters is very large. In general, it is very difficult to know if the network has memorized a particular set of examples or understood the underlying rule (or both). Motivated by this challenge, we study an interpretable model where generalizing representations are understood analytically, and are easily distinguishable from the memorizing ones. Namely, we consider multi-layer perceptron (MLP) and Transformer architectures trained on modular arithmetic tasks, where () of labels are corrupted (\emph{i.e.} some results of the modular operations in the training set are incorrect). We show that (i) it is possible for the network to memorize the corrupted labels \emph{and} achieve generalization at the same time; (ii) the memorizing neurons can be identified and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Layer Normalization · Byte Pair Encoding · Multi-Head Attention · Softmax · Dense Connections · Label Smoothing · Adam
