Is Stochastic Gradient Descent Near Optimal?
Yifan Zhu (1), Hong Jun Jeon (1), Benjamin Van Roy (1) ((1) Stanford, University Department of Electrical Engineering)

TL;DR
This paper shows that stochastic gradient descent (SGD) can efficiently learn single-hidden-layer ReLU neural networks with near-optimal sample complexity, bridging the gap between statistical optimality and computational feasibility.
Contribution
The paper demonstrates that SGD with automated width selection nearly achieves information-theoretic sample complexity bounds for learning ReLU networks, despite computational intractability in worst-case scenarios.
Findings
SGD attains small expected error with nearly linear samples in input dimension and width.
Empirical results contrast with worst-case theoretical intractability.
SGD's efficiency aligns with statistical optimality bounds.
Abstract
The success of neural networks over the past decade has established them as effective models for many relevant data generating processes. Statistical theory on neural networks indicates graceful scaling of sample complexity. For example, Joen & Van Roy (arXiv:2203.00246) demonstrate that, when data is generated by a ReLU teacher network with parameters, an optimal learner needs only samples to attain expected error . However, existing computational theory suggests that, even for single-hidden-layer teacher networks, to attain small error for all such teacher networks, the computation required to achieve this sample complexity is intractable. In this work, we fit single-hidden-layer neural networks to data generated by single-hidden-layer ReLU teacher networks with parameters drawn from a natural distribution. We demonstrate that stochastic gradient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Machine Learning and ELM
MethodsStochastic Gradient Descent
