Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK
Yuanzhi Li, Tengyu Ma, Hongyang R. Zhang

TL;DR
This paper demonstrates that over-parameterized two-layer ReLU neural networks trained with gradient descent can efficiently learn certain functions beyond the capabilities of kernel methods like NTK, with provable guarantees.
Contribution
It provides the first theoretical proof that over-parameterized neural networks can learn beyond the NTK regime in polynomial time with polynomial samples.
Findings
Neural networks achieve population loss at most o(1/d).
Kernel methods have population loss at least Ω(1/d).
Gradient descent can learn the target function efficiently.
Abstract
We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input is drawn from a Gaussian distribution and the label of satisfies , where is a nonnegative vector and is an orthonormal matrix. We show that an over-parametrized two-layer neural network with ReLU activation, trained by gradient descent from random initialization, can provably learn the ground truth network with population loss at most in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in , has population loss at least .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Machine Learning and ELM
Methods*Communicated@Fast*How Do I Communicate to Expedia?
