Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data

Zhiwei Xu; Yutong Wang; Spencer Frei; Gal Vardi; Wei Hu

arXiv:2310.02541·cs.LG·October 5, 2023

Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data

Zhiwei Xu, Yutong Wang, Spencer Frei, Gal Vardi, Wei Hu

PDF

Open Access 3 Reviews

TL;DR

This paper provides the first theoretical proof of benign overfitting and grokking phenomena in two-layer ReLU neural networks trained with gradient descent on XOR cluster data with noisy labels, revealing the feature learning process.

Contribution

It demonstrates that both benign overfitting and grokking occur in neural networks on non-linearly separable data, with detailed analysis of feature learning over training.

Findings

01

Network fits noisy labels after first GD step with poor test performance.

02

Later training leads to near-optimal test accuracy while still fitting noise.

03

Feature learning transitions from linear to generalizable representations.

Abstract

Neural networks trained by gradient descent (GD) have exhibited a number of surprising generalization behaviors. First, they can achieve a perfect fit to noisy training data and still generalize near-optimally, showing that overfitting can sometimes be benign. Second, they can undergo a period of classical, harmful overfitting -- achieving a perfect fit to training data with near-random performance on test data -- before transitioning ("grokking") to near-optimal generalization later in training. In this work, we show that both of these phenomena provably occur in two-layer ReLU networks trained by GD on XOR cluster data where a constant fraction of the training labels are flipped. In this setting, we show that after the first step of GD, the network achieves 100% training accuracy, perfectly fitting the noisy labels in the training data, but achieves near-random test accuracy. At a…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

### Notable results on benign overfitting of neural networks beyond linearly separable data As far as I understand, proving benign overfitting for neural network involves several difficulties due to its nonlinearity, and especially I agree that showing the superiority of neural network to linear methods by learning nonlinear target function has been largely open in this context. I think XOR cluster data is a good starting point to this problem and this paper proves benign overfitting under mode

Weaknesses

### Justification of the small initialization In my understanding, it is crucial to take a small initialization scale compared to the step size $\alpha$ to obtain the perfect overfitting at the first gradient step. I think this is acceptable as theory, but it should be better to justify such a small initialization. I also want to know what happens if the initialization scale is much larger than used in Figure 3. ### Large signal-to-noise ratio Compared to [Ji & Telgarsky (2019)](https://arxiv

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

* The paper offers a new theoretical examination of benign overfitting and "grokking" in two-layer ReLU neural networks. It focuses on XOR cluster data with noisy labels, giving a detailed exploration and proofs related to these phenomena. The authors use existing concepts and new theories to better explain the behavior of neural networks with noisy training data. * The paper is structured and clear, effectively communicating the authors’ work and results. It has a logical organization that mak

Weaknesses

* The assumption made in A1 seems to contradict common understanding. Generally, increasing the number of samples, even with limited noisy labels, tends to enhance the generalization capability of neural networks. However, in Assumption A1, having a larger number of training samples seems to adversely affect the model, as indicated by its presence on the right-hand side of the inequality. This aspect might require further clarification or justification within the context of the study. * The mec

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

This paper is the first to study the combination of benign overfitting and the Grokking phenomenon in neural networks.

Weaknesses

My main concern about this paper lies in its assumptions. Combining assumptions A1 and A2, we can obtain $p\geq C^4 n^{5.02}$. This is an extremely high-dimensional setting.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Face and Expression Recognition · Human Pose and Action Recognition