Escaping mediocrity: how two-layer networks learn hard generalized linear models with SGD
Luca Arnaboldi, Florent Krzakala, Bruno Loureiro, Ludovic Stephan

TL;DR
This paper analyzes how two-layer neural networks learn generalized linear models with SGD, revealing that overparameterization offers limited benefits and stochasticity plays a minor role in escaping flat initialization regions.
Contribution
It provides precise sample complexity results for two-layer networks, showing overparameterization's limited impact and the effectiveness of deterministic approximations in analyzing SGD dynamics.
Findings
Overparameterization improves convergence only by a constant factor.
Deterministic approximations effectively model SGD escape times.
Minimal stochasticity influence in escaping flat regions at initialization.
Abstract
This study explores the sample complexity for two-layer neural networks to learn a generalized linear target function under Stochastic Gradient Descent (SGD), focusing on the challenging regime where many flat directions are present at initialization. It is well-established that in this scenario samples are typically needed. However, we provide precise results concerning the pre-factors in high-dimensional contexts and for varying widths. Notably, our findings suggest that overparameterization can only enhance convergence by a constant factor within this problem class. These insights are grounded in the reduction of SGD dynamics to a stochastic process in lower dimensions, where escaping mediocrity equates to calculating an exit time. Yet, we demonstrate that a deterministic approximation of this process adequately represents the escape time, implying that the role of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Neural Networks and Applications
MethodsStochastic Gradient Descent
