Benignity of loss landscape with weight decay requires both large overparametrization and initialization

Etienne Boursier; Matthew Bowditch; Matthias Englert; Ranko Lazic

arXiv:2505.22578·cs.LG·May 29, 2025

Benignity of loss landscape with weight decay requires both large overparametrization and initialization

Etienne Boursier, Matthew Bowditch, Matthias Englert, Ranko Lazic

PDF

Open Access

TL;DR

This paper demonstrates that for two-layer ReLU networks with weight decay, a benign loss landscape free of spurious minima occurs only under large overparameterization and initialization, highlighting the importance of these factors.

Contribution

It provides a theoretical analysis showing the conditions under which the loss landscape becomes benign in regularized neural networks, emphasizing the roles of overparameterization and initialization.

Findings

01

Benign landscape occurs with large overparametrization when m ≥ min(n^d, 2^n).

02

Almost all constant activation regions contain a global minimum under these conditions.

03

Small initializations can still lead to spurious minima despite benign landscapes at large initializations.

Abstract

The optimization of neural networks under weight decay remains poorly understood from a theoretical standpoint. While weight decay is standard practice in modern training procedures, most theoretical analyses focus on unregularized settings. In this work, we investigate the loss landscape of the $ℓ_{2}$ -regularized training loss for two-layer ReLU networks. We show that the landscape becomes benign -- i.e., free of spurious local minima -- under large overparametrization, specifically when the network width $m$ satisfies $m ≳ min (n^{d}, 2^{n})$ , where $n$ is the number of data points and $d$ the input dimension. More precisely in this regime, almost all constant activation regions contain a global minimum and no spurious local minima. We further show that this level of overparametrization is not only sufficient but also necessary via the example of orthogonal data. Finally, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Model Reduction and Neural Networks