Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks
Ichiro Hashimoto

TL;DR
This paper establishes conditions under which leaky ReLU two-layer neural networks trained with gradient descent exhibit benign overfitting on mixture data, extending previous results by proving directional convergence and identifying phase transitions.
Contribution
It introduces the first proof of directional convergence for leaky ReLU networks trained with gradient descent on mixture data, broadening understanding of benign overfitting beyond orthogonal data.
Findings
Benign overfitting occurs in a wider range of scenarios than previously known.
Directional convergence is established for leaky ReLU networks trained with gradient descent.
A new phase transition in classification error is identified.
Abstract
In this paper, we provide sufficient conditions of benign overfitting of fixed width leaky ReLU two-layer neural network classifiers trained on mixture data via gradient descent. Our results are derived by establishing directional convergence of the network parameters and classification error bound of the convergent direction. Our classification error bound also lead to the discovery of a newly identified phase transition. Previously, directional convergence in (leaky) ReLU neural networks was established only for gradient flow. Due to the lack of directional convergence, previous results on benign overfitting were limited to those trained on nearly orthogonal data. All of our results hold on mixture data, which is a broader data setting than the nearly orthogonal data setting in prior work. We demonstrate our findings by showing that benign overfitting occurs with high probability in a…
Peer Reviews
Decision·ICLR 2026 Poster
The paper seems technically sound and gives legitimate improvements over prior work. The first improvement is that the analysis holds beyond the nearly orthogonal setting, which is necessary for understanding the types of data that occur in practice, particularly when the means of the two classes are far apart. They also understand the convergent direction of the weights in a data-dependent way, which may make these results easier to apply as a black box in other settings. In addition to obta
Some of the actual results are perhaps a bit incrememental. For example the convergent direction is precisely that of Frei et al. as should be expected. On the other hand, relaxing the orthogonality assumption seems like a nice direction, as previous work in the area was unable to do this. The actual model is fairly weak compared to modern DL models, and so may be limited in terms of understanding why benign overfitting occurs in practice. Benign overfitting in two-layer networks is an establ
Main result on convergence of the parameters also gives the geometry of the decision boundary (which is linear). Goes beyond the nearly orthogonal setting that many previous work assumes. Is well-written.
The term "benign overfitting" is used throughout and in theorem 6.3, but I could not find the mathematical definition of the term. Is interpolation implied by result on Line 258? Also would be helpful to explicitly to write down the Bayes error rate (perhaps it's obvious and I just missed it, but I can't figure it out while reading).
The two key technical contributions of this paper seem to me to be as follows. - convergence in direction for gradient descent, not just gradient flow. Although this might feel intuitive this is not I would imagine easy to prove, you need to show some uniform control on the angle between successive gradients which is complicated by the fact that the gradient angle can jump around discontinuously due to the non-smoothness of leaky-ReLU. - identifies benign versus harmful overfitting regimes with
The main weaknesses of the paper I think are as follows - Scope / relevance of the setup, namely shallow leaky-relu network, only training the inner layer weights, data which is linearly separable / drawn from a relatively simple distribution. Analysis even in this setting is pretty challenging though so I do not think this is a major issue. - A flip side of one of the paper's strengths is that it does not provide finite-sample or high-probability generalization guarantees. The absence of sub-Ga
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Machine Learning and ELM
Methods*Communicated@Fast*How Do I Communicate to Expedia? · HuMan(Expedia)||How do I get a human at Expedia?
