Learning Hierarchical Polynomials with Three-Layer Neural Networks
Zihao Wang, Eshaan Nichani, Jason D. Lee

TL;DR
This paper proves that three-layer neural networks can efficiently learn hierarchical polynomial functions with polynomial sample complexity, outperforming kernel methods and previous neural network guarantees, especially for quadratic target functions.
Contribution
It establishes the first polynomial-time, polynomial-sample guarantees for learning hierarchical polynomials with three-layer neural networks, extending prior quadratic-specific results.
Findings
Neural networks learn hierarchical polynomials with $ ilde{O}(d^k)$ samples.
Three-layer networks outperform kernel methods for these functions.
Optimal sample complexity $ ilde{O}(d^2)$ achieved for quadratic polynomials.
Abstract
We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form where is a degree polynomial and is a degree polynomial. This function class generalizes the single-index model, which corresponds to , and is a natural class of functions possessing an underlying hierarchical structure. Our main result shows that for a large subclass of degree polynomials , a three-layer neural network trained via layerwise gradient descent on the square loss learns the target up to vanishing test error in samples and polynomial time. This is a strict improvement over kernel methods, which require samples, as…
Peer Reviews
Decision·ICLR 2024 poster
- The paper considers the problem of understanding the benefits of feature learning for multi-layer neural networks. The one layer case has attracted a lot of attention, while multi-layer is so far way less understood. - They are able to show a large separation with kernel methods. The reason is that the first layer is able to extract a good representation of the data (the polynomial $p(x)$) from only seeing samples $g \circ p (x)$. - This class of target functions naturally generalizes the sing
- The architecture and algorithm are chosen specifically to succeed for this specific class of target functions (composition of degree-$k$ multivariate polynomials with univariate functions). Several previous works have considered such layer-wise training on non-regular architectures for specific hierarchical classes of functions, including [Allen-Zhu,Li,2020] and [The staircase property, Abbe, Boix, Brennan, Bresler, Nagaraj, 2020]. It is unclear how this paper contributes in terms of novel ide
The paper is pleasant to read and the mathematical results are supported with extensive discussion.
The main weakness of the paper is the close relationship with previous works. Although a fair comparison is given, the works by [Allen-Zhu & Li, 2019;2020] and [Nichani et al. 2023] contain many of the key ideas in the manuscript.
The paper's main result is that for a subclass of degree $k$ polynomials $p$ and standard Gaussian marginals, a 3-layer NN trained via layerwise GD on the $L_2$ loss learns the target hierarchical polynomial (realizable setting) with roughly $d^k$ samples and runtime that is polynomial in the parameters of the problem. I think that the result is interesting and fits well with the ICLR community. In general, the paper is easy to read: the assumptions are presented in a clear manner, comparison w
I believe that it would be beneficial if the authors mentioned families of polynomials not captured by Assumption 4. This would make more clear how strong and restrictive this assumption is. This assumption highly simplifies the analysis and, hence, it would be nice if the authors could further discuss on this assumption (I see why the families of Remark 3 satisfy this condition, but I think a further discussion on how this assumption simplifies the analysis would be helpful).
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning in Healthcare · Machine Learning and Data Classification
