To grok or not to grok: Disentangling generalization and memorization on   corrupted algorithmic datasets

Darshil Doshi; Aritra Das; Tianyu He; Andrey Gromov

arXiv:2310.13061·cs.LG·March 6, 2024·1 cites

To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets

Darshil Doshi, Aritra Das, Tianyu He, Andrey Gromov

PDF

Open Access 1 Repo

TL;DR

This paper investigates how neural networks can memorize corrupted data while still generalizing well, and how regularization techniques help promote generalization by suppressing memorization, with insights into training dynamics and neuron behavior.

Contribution

It provides an interpretable model demonstrating the coexistence of memorization and generalization, and explains how regularization methods influence this balance in neural networks.

Findings

01

Networks can memorize corrupted labels and still generalize.

02

Pruning memorizing neurons improves uncorrupted data accuracy.

03

Regularization methods promote generalization by suppressing memorization.

Abstract

Robust generalization is a major challenge in deep learning, particularly when the number of trainable parameters is very large. In general, it is very difficult to know if the network has memorized a particular set of examples or understood the underlying rule (or both). Motivated by this challenge, we study an interpretable model where generalizing representations are understood analytically, and are easily distinguishable from the memorizing ones. Namely, we consider multi-layer perceptron (MLP) and Transformer architectures trained on modular arithmetic tasks, where ( $ξ \cdot 100%$ ) of labels are corrupted (\emph{i.e.} some results of the modular operations in the training set are incorrect). We show that (i) it is possible for the network to memorize the corrupted labels \emph{and} achieve $100%$ generalization at the same time; (ii) the memorizing neurons can be identified and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

d-doshi/Grokking
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification

MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Layer Normalization · Byte Pair Encoding · Multi-Head Attention · Softmax · Dense Connections · Label Smoothing · Adam