Learning Hierarchical Polynomials with Three-Layer Neural Networks

Zihao Wang; Eshaan Nichani; Jason D. Lee

arXiv:2311.13774·cs.LG·November 27, 2023·1 cites

Learning Hierarchical Polynomials with Three-Layer Neural Networks

Zihao Wang, Eshaan Nichani, Jason D. Lee

PDF

Open Access 3 Reviews

TL;DR

This paper proves that three-layer neural networks can efficiently learn hierarchical polynomial functions with polynomial sample complexity, outperforming kernel methods and previous neural network guarantees, especially for quadratic target functions.

Contribution

It establishes the first polynomial-time, polynomial-sample guarantees for learning hierarchical polynomials with three-layer neural networks, extending prior quadratic-specific results.

Findings

01

Neural networks learn hierarchical polynomials with $ ilde{O}(d^k)$ samples.

02

Three-layer networks outperform kernel methods for these functions.

03

Optimal sample complexity $ ilde{O}(d^2)$ achieved for quadratic polynomials.

Abstract

We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form $h = g \circ p$ where $p : R^{d} \to R$ is a degree $k$ polynomial and $g : R \to R$ is a degree $q$ polynomial. This function class generalizes the single-index model, which corresponds to $k = 1$ , and is a natural class of functions possessing an underlying hierarchical structure. Our main result shows that for a large subclass of degree $k$ polynomials $p$ , a three-layer neural network trained via layerwise gradient descent on the square loss learns the target $h$ up to vanishing test error in $O (d^{k})$ samples and polynomial time. This is a strict improvement over kernel methods, which require $Θ (d^{k q})$ samples, as…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- The paper considers the problem of understanding the benefits of feature learning for multi-layer neural networks. The one layer case has attracted a lot of attention, while multi-layer is so far way less understood. - They are able to show a large separation with kernel methods. The reason is that the first layer is able to extract a good representation of the data (the polynomial $p(x)$) from only seeing samples $g \circ p (x)$. - This class of target functions naturally generalizes the sing

Weaknesses

- The architecture and algorithm are chosen specifically to succeed for this specific class of target functions (composition of degree-$k$ multivariate polynomials with univariate functions). Several previous works have considered such layer-wise training on non-regular architectures for specific hierarchical classes of functions, including [Allen-Zhu,Li,2020] and [The staircase property, Abbe, Boix, Brennan, Bresler, Nagaraj, 2020]. It is unclear how this paper contributes in terms of novel ide

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 2

Strengths

The paper is pleasant to read and the mathematical results are supported with extensive discussion.

Weaknesses

The main weakness of the paper is the close relationship with previous works. Although a fair comparison is given, the works by [Allen-Zhu & Li, 2019;2020] and [Nichani et al. 2023] contain many of the key ideas in the manuscript.

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

The paper's main result is that for a subclass of degree $k$ polynomials $p$ and standard Gaussian marginals, a 3-layer NN trained via layerwise GD on the $L_2$ loss learns the target hierarchical polynomial (realizable setting) with roughly $d^k$ samples and runtime that is polynomial in the parameters of the problem. I think that the result is interesting and fits well with the ICLR community. In general, the paper is easy to read: the assumptions are presented in a clear manner, comparison w

Weaknesses

I believe that it would be beneficial if the authors mentioned families of polynomials not captured by Assumption 4. This would make more clear how strong and restrictive this assumption is. This assumption highly simplifies the analysis and, hence, it would be nice if the authors could further discuss on this assumption (I see why the families of Remark 3 satisfy this condition, but I think a further discussion on how this assumption simplifies the analysis would be helpful).

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Machine Learning in Healthcare · Machine Learning and Data Classification