Is Stochastic Gradient Descent Near Optimal?

Yifan Zhu (1); Hong Jun Jeon (1); Benjamin Van Roy (1) ((1) Stanford; University Department of Electrical Engineering)

arXiv:2209.08627·cs.LG·March 28, 2023

Is Stochastic Gradient Descent Near Optimal?

Yifan Zhu (1), Hong Jun Jeon (1), Benjamin Van Roy (1) ((1) Stanford, University Department of Electrical Engineering)

PDF

Open Access

TL;DR

This paper shows that stochastic gradient descent (SGD) can efficiently learn single-hidden-layer ReLU neural networks with near-optimal sample complexity, bridging the gap between statistical optimality and computational feasibility.

Contribution

The paper demonstrates that SGD with automated width selection nearly achieves information-theoretic sample complexity bounds for learning ReLU networks, despite computational intractability in worst-case scenarios.

Findings

01

SGD attains small expected error with nearly linear samples in input dimension and width.

02

Empirical results contrast with worst-case theoretical intractability.

03

SGD's efficiency aligns with statistical optimality bounds.

Abstract

The success of neural networks over the past decade has established them as effective models for many relevant data generating processes. Statistical theory on neural networks indicates graceful scaling of sample complexity. For example, Joen & Van Roy (arXiv:2203.00246) demonstrate that, when data is generated by a ReLU teacher network with $W$ parameters, an optimal learner needs only $\tilde{O} (W / ϵ)$ samples to attain expected error $ϵ$ . However, existing computational theory suggests that, even for single-hidden-layer teacher networks, to attain small error for all such teacher networks, the computation required to achieve this sample complexity is intractable. In this work, we fit single-hidden-layer neural networks to data generated by single-hidden-layer ReLU teacher networks with parameters drawn from a natural distribution. We demonstrate that stochastic gradient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Machine Learning and ELM

MethodsStochastic Gradient Descent