Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
Zeyuan Allen-Zhu, Yuanzhi Li, Yingyu Liang

TL;DR
This paper proves that overparameterized neural networks, including two and three-layer models with smooth activations, can learn complex functions efficiently with polynomial time and sample complexity, surpassing NTK limitations.
Contribution
It introduces a new quadratic approximation framework for neural networks, enabling analysis beyond NTK and demonstrating learnability of certain classes with fewer parameters.
Findings
Overparameterized networks can learn notable concept classes.
SGD can train these networks efficiently in polynomial time.
Sample complexity is nearly independent of network size.
Abstract
The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained network overfit when it is overparameterized? In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network. On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network (that can be viewed as a second-order…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning
MethodsNeural Tangent Kernel · Stochastic Gradient Descent
