TL;DR
This paper explores how the learning rate in gradient-based algorithms influences phase transitions in sample complexity for neural network feature learning, revealing a critical relationship between learning rate and efficiency.
Contribution
It characterizes the impact of learning rate on phase transitions in sample complexity, unifies prior analyses, and introduces a new layer-wise training algorithm leveraging two-timescales.
Findings
Identifies phase transition from information to generative exponent regimes based on learning rate.
Demonstrates the importance of learning rate choice in statistical and computational efficiency.
Introduces a novel layer-wise training method with two different learning rates.
Abstract
To understand feature learning dynamics in neural networks, recent theoretical works have focused on gradient-based learning of Gaussian single-index models, where the label is a nonlinear function of a latent one-dimensional projection of the input. While the sample complexity of online SGD is determined by the information exponent of the link function, recent works improved this by performing multiple gradient steps on the same sample with different learning rates -- yielding a non-correlational update rule -- and instead are limited by the (potentially much smaller) generative exponent. However, this picture is only valid when these learning rates are sufficiently large. In this paper, we characterize the relationship between learning rate(s) and sample complexity for a broad class of gradient-based algorithms that encapsulates both correlational and non-correlational updates. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
