Noisy Interpolation Learning with Shallow Univariate ReLU Networks

Nirmit Joshi; Gal Vardi; Nathan Srebro

arXiv:2307.15396·cs.LG·March 25, 2024

Noisy Interpolation Learning with Shallow Univariate ReLU Networks

Nirmit Joshi, Gal Vardi, Nathan Srebro

PDF

Open Access 3 Reviews

TL;DR

This paper rigorously analyzes how shallow univariate ReLU networks overfit noisy data, revealing tempered overfitting with respect to L1 loss but catastrophic overfitting with L2 loss.

Contribution

It provides the first rigorous analysis of overfitting in minimum norm regression with shallow ReLU networks, highlighting different behaviors under L1 and L2 losses.

Findings

01

Overfitting is tempered with respect to L1 loss.

02

Overfitting is catastrophic with respect to L2 loss.

03

Behavior differs when averaging over training sets.

Abstract

Understanding how overparameterized neural networks generalize despite perfect interpolation of noisy training data is a fundamental question. Mallinar et. al. 2022 noted that neural networks seem to often exhibit ``tempered overfitting'', wherein the population risk does not converge to the Bayes optimal error, but neither does it approach infinity, yielding non-trivial generalization. However, this has not been studied rigorously. We provide the first rigorous analysis of the overfitting behavior of regression with minimum norm ( $ℓ_{2}$ of weights), focusing on univariate two-layer ReLU networks. We show overfitting is tempered (with high probability) when measured with respect to the $L_{1}$ loss, but also show that the situation is more complex than suggested by Mallinar et. al., and overfitting is catastrophic with respect to the $L_{2}$ loss, or when taking an expectation over the…

Peer Reviews

Decision·ICLR 2024 spotlight

Reviewer 01Rating 8· accept, good paperConfidence 4

Strengths

- Understanding the generalization performance of interpolating solutions, especially non-linear interpolators such as ReLU networks, is an important and interesting question in deep learning theory. - The paper gives a detailed characterization of the min $\ell_2$ norm interpolating ReLU networks, compares it with the linear splines, and shows the subtle generalization performance depending on the loss function. I believe this is a good result and it seems to be novel in the literature of impli

Weaknesses

- The paper focuses on the univariate ReLU networks which is relatively simple. (though it is understandable from technical point of view) - The results are in the asymptotic regime that sample size $n$ goes to infinity. Thus, it does not give explicit rate of convergence.

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

This paper is exceptionally well-crafted, boasting a highly organized structure that enhances its clarity and readability. The main paper is thoughtfully structured, and the presentation of the proof concept is remarkably accessible, thanks in part to the informative graphs provided. The theoretical framework is excellent, as this paper conducts a comprehensive examination of the overfitting tendencies observed in min-norm ReLU networks within the context of regression. In doing so, it effectiv

Weaknesses

It appears that this paper can be regarded as a subsequent work to Boursier & Flammarion (2023). The connection is evident as the pivotal lemma (Lemma 2.1) employed in this paper is directly drawn from Boursier & Flammarion (2023). Furthermore, the neural network model studied in this paper aligns with the one extensively examined in Boursier & Flammarion (2023). Consequently, the technical innovation in this paper seems somewhat limited in this regard. It's important to note that this paper ad

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

The generalization performance of linear splines and min-norm solution is studied in terms of tempered overfitting and catastrophic overfitting

Weaknesses

There is no distinct drawback in my view. The proof is based on the nice statistical property of $\ell_i$, which might be the key technical difficulty when extending to the $d$-dimensional data. For example, the two-dimensional data, we sort the data, split the space, define the risk in the two-dimensional interval. But the estimation based on $\ell_i$ is unclear to me. Besides, comparison with (Kornowski et al. 2023) requires more discussion, especially in terms of the technical tools. Apart

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Neural Networks and Applications · Gaussian Processes and Bayesian Inference