The staircase property: How hierarchical structure can guide deep learning
Emmanuel Abbe, Enric Boix-Adsera, Matthew Brennan, Guy Bresler,, Dheeraj Nagaraj

TL;DR
This paper introduces the staircase property, a structural feature of data enabling deep neural networks to learn hierarchically, and demonstrates its significance through theoretical proofs and experiments with standard architectures.
Contribution
The paper defines the staircase property for functions over the Boolean hypercube and proves that neural networks can learn such functions efficiently using layerwise stochastic coordinate descent.
Findings
Staircase functions are learnable in polynomial time by neural networks.
Gradient-based algorithms learn high-level features by combining lower-level features.
Experiments show staircase functions are learnable by standard ResNet architectures.
Abstract
This paper identifies a structural property of data distributions that enables deep neural networks to learn hierarchically. We define the "staircase" property for functions over the Boolean hypercube, which posits that high-order Fourier coefficients are reachable from lower-order Fourier coefficients along increasing chains. We prove that functions satisfying this property can be learned in polynomial time using layerwise stochastic coordinate descent on regular neural networks -- a class of network architectures and initializations that have homogeneity properties. Our analysis shows that for such staircase functions and neural networks, the gradient-based algorithm learns high-level features by greedily combining lower-level features along the depth of the network. We further back our theoretical results with experiments showing that staircase functions are also learnable by more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning and Algorithms · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Batch Normalization · Average Pooling · 1x1 Convolution · Convolution · Residual Block · Bottleneck Residual Block · Global Average Pooling · Kaiming Initialization
