Does Flatness imply Generalization for Logistic Loss in Univariate Two-Layer ReLU Network?
Dan Qiao, Yu-Xiang Wang

TL;DR
This paper investigates whether flatness of solutions guarantees generalization in univariate two-layer ReLU networks with logistic loss, revealing both positive bounds and limitations of flatness as a predictor.
Contribution
It provides the first analysis of flatness and generalization under logistic loss, showing conditions where flatness implies good generalization and cases where it does not.
Findings
Flat solutions can have near-optimal generalization bounds within certain regions.
Existence of arbitrarily flat overfitting solutions at infinity that are falsely certain.
Flatness alone is insufficient to guarantee generalization in logistic loss scenarios.
Abstract
We consider the problem of generalization of arbitrarily overparameterized two-layer ReLU Neural Networks with univariate input. Recent work showed that under square loss, flat solutions (motivated by flat / stable minima and Edge of Stability phenomenon) provably cannot overfit, but it remains unclear whether the same phenomenon holds for logistic loss. This is a puzzling open problem because existing work on logistic loss shows that gradient descent with increasing step size converges to interpolating solutions (at infinity, for the margin-separable cases). In this paper, we prove that the \emph{flatness implied generalization} is more delicate under logistic loss. On the positive side, we show that flat solutions enjoy near-optimal generalization bounds within a region between the left-most and right-most \emph{uncertain} sets determined by each candidate solution. On the negative…
Peer Reviews
Decision·Submitted to ICLR 2026
In theorem 3.1 the authors prove an upper bound on the weighted TV norm of a function (from the input data to output predictions) in terms of the max eigenvalue of the loss hessian (which is a function from the parameters). This result is an interesting link between two fundamentally different types of object. In theorem 3.2, the authors construct an example function whose max eigenvalue of the loss hessian goes to zero, and therefore the function is flat (measured by TV norm). however, the fu
The paper is very symbol heavy. For instance, the weight hγ in the TV norm in Theorem 3.1 is difficult to interpret to the point i can't really tell if the left hand side of inequality (5) is truly a measure of flatness.
This paper addresses an interesting topic—how dynamical stability, or the curvature of the loss landscape, influences the learned predictor in classification tasks. The authors present results from several perspectives, including bounds on the total variation of the predictor’s output, excess risk, and generalization performance. The results are clearly formulated, and the assumptions are well-specified. Additionally, the theoretical findings are supported by experiments.
I believe the main weakness of the paper lies in the lack of interpretation of its results. Unlike in the regression setting, where previous findings were relatively straightforward to interpret, several conclusions in this work remain unclear or insufficiently discussed (see questions below). Moreover, the theory of dynamical stability referenced in the paper applies to local minima, which are equilibria of the gradient descent mapping. However, as the authors themselves note, in the overparam
The paper furthers work on the question of whether flatness implies generalization by investigating logistic loss. The authors provide a concrete example where flatness is not sufficient for generalization, which is in stark contrast to previous work with square loss in which flatness was sufficient. A new analysis technique involving the uncertain sets, as opposed to considering the entire domain, is developed to understand the conditions under which flatness implies generalization in this clas
The presentation is relatively dense, challenging to follow for a reader who is not a specialist in the area. The new ideas in the paper should have been presented more clearly and discussed further. Similarly, the significance of the main results could have been justified more and explained further. Theorem 3.3 demonstrates that the generalisation gap of $f_\theta$ is small when $\theta$ is small, however this is rarely mentioned further in the paper. It is finally compared with Theorem 3.7
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Machine Learning and ELM
