Beyond NTK with Vanilla Gradient Descent: A Mean-Field Analysis of Neural Networks with Polynomial Width, Samples, and Time
Arvind Mahankali, Jeff Z. Haochen, Kefan Dong, Margalit Glasgow,, Tengyu Ma

TL;DR
This paper demonstrates that vanilla gradient descent on polynomial-width two-layer neural networks can outperform kernel methods in terms of sample complexity, using a mean-field analysis without unnatural modifications.
Contribution
It provides a mean-field analysis showing unmodified gradient descent achieves better sample complexity than kernel methods, with polynomial convergence guarantees.
Findings
Gradient flow converges with $n=O(d^{3.1})$ samples
Network outperforms kernel methods with fewer samples
Projected gradient descent converges to low error with polynomial iterations
Abstract
Despite recent theoretical progress on the non-convex optimization of two-layer neural networks, it is still an open question whether gradient descent on neural networks without unnatural modifications can achieve better sample complexity than kernel methods. This paper provides a clean mean-field analysis of projected gradient flow on polynomial-width two-layer neural networks. Different from prior works, our analysis does not require unnatural modifications of the optimization algorithm. We prove that with sample size where is the dimension of the inputs, the network trained with projected gradient flow converges in time to a non-trivial error that is not achievable by kernel methods using samples, hence demonstrating a clear separation between unmodified gradient descent and NTK. As a corollary, we show that projected gradient descent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Model Reduction and Neural Networks
MethodsNeural Tangent Kernel
