Grokking in Linear Models for Logistic Regression
Nataraj Das, Atreya Vedantam, Chandrashekar Lakshminarayanan

TL;DR
This paper demonstrates that grokking, or delayed generalization, can occur in simple linear models with logistic loss, driven by gradient descent dynamics and data asymmetries, without the need for deep neural networks.
Contribution
It provides a theoretical and empirical analysis of grokking in linear models, revealing the underlying phases and data conditions that lead to delayed generalization.
Findings
Grokking occurs in linear models under certain data distributions.
The implicit bias of gradient descent induces a three-phase learning process.
Grokking can happen even without depth or complex representations.
Abstract
Grokking, the phenomenon of delayed generalization, is often attributed to the depth and compositional structure of deep neural networks. We study grokking in one of the simplest possible settings: the learning of a linear model with logistic loss for binary classification on data that are linearly (and max margin) separable about the origin. We investigate three testing regimes: (1) test data drawn from the same distribution as the training data, in which case grokking is not observed; (2) test data concentrated around the margin, in which case grokking is observed; and (3) adversarial test data generated via projected gradient descent (PGD) attacks, in which case grokking is also observed. We theoretically show that the implicit bias of gradient descent induces a three-phase learning process-population-dominated, support-vector-dominated unlearning, and support-vector-dominated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning
