SGD Learns the Conjugate Kernel Class of the Network
Amit Daniely

TL;DR
This paper proves that stochastic gradient descent (SGD) can efficiently learn functions within the conjugate kernel space of certain deep neural networks, providing the first polynomial-time guarantees for networks deeper than two layers.
Contribution
It establishes the first polynomial-time learning guarantee for standard SGD on deep networks of more than two layers, connecting neural network training to kernel methods.
Findings
SGD learns functions in the conjugate kernel space of the network.
SGD guarantees polynomial-time learning of constant degree polynomials.
SGD on large networks can learn any continuous function.
Abstract
We show that the standard stochastic gradient decent (SGD) algorithm is guaranteed to learn, in polynomial time, a function that is competitive with the best function in the conjugate kernel space of the network, as defined in Daniely, Frostig and Singer. The result holds for log-depth networks from a rich family of architectures. To the best of our knowledge, it is the first polynomial-time guarantee for the standard neural network learning algorithm for networks of depth more that two. As corollaries, it follows that for neural networks of any depth between and , SGD is guaranteed to learn, in polynomial time, constant degree polynomials with polynomially bounded coefficients. Likewise, it follows that SGD on large enough networks can learn any continuous function (not in polynomial time), complementing classical expressivity results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Neural Networks and Applications
MethodsStochastic Gradient Descent
