Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent
Behrad Moniri, Hamed Hassani

TL;DR
This paper analyzes feature learning in two-layer neural networks during the second gradient descent step, revealing how multiple learned directions emerge depending on learning rates and batch reuse, extending beyond one-step analysis.
Contribution
It provides a spectral characterization of features learned after two gradient steps in high-dimensional linear-width networks, highlighting the impact of batch reuse on learning directions.
Findings
Multiple learned directions correspond to outliers in the spectral distribution.
Batch reuse enables learning of directions with higher information exponent.
Number of learned directions depends on the scaling parameters of the gradient steps.
Abstract
We study feature learning in two-layer neural networks within the linear-width regime, where the number of hidden neurons, sample size, and input dimension scale proportionally. While recent work has analyzed feature learning via a single step of gradient descent, such updates are fundamentally limited: they are approximately rank-one, capturing only a single direction, and require the target function to have an information exponent of one. In this paper, we go beyond one-step updates to provide a full characterization of the features learned during the second step of gradient descent with step-sizes and for . We derive a sharp spectral characterization of the updated weights, demonstrating they behave as a spiked random matrix with multiple outliers, each corresponding to a learned direction. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
