Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent

Behrad Moniri; Hamed Hassani

arXiv:2605.17767·stat.ML·May 19, 2026

Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent

Behrad Moniri, Hamed Hassani

PDF

TL;DR

This paper analyzes feature learning in two-layer neural networks during the second gradient descent step, revealing how multiple learned directions emerge depending on learning rates and batch reuse, extending beyond one-step analysis.

Contribution

It provides a spectral characterization of features learned after two gradient steps in high-dimensional linear-width networks, highlighting the impact of batch reuse on learning directions.

Findings

01

Multiple learned directions correspond to outliers in the spectral distribution.

02

Batch reuse enables learning of directions with higher information exponent.

03

Number of learned directions depends on the scaling parameters of the gradient steps.

Abstract

We study feature learning in two-layer neural networks within the linear-width regime, where the number of hidden neurons, sample size, and input dimension scale proportionally. While recent work has analyzed feature learning via a single step of gradient descent, such updates are fundamentally limited: they are approximately rank-one, capturing only a single direction, and require the target function to have an information exponent of one. In this paper, we go beyond one-step updates to provide a full characterization of the features learned during the second step of gradient descent with step-sizes $η_{1} ≍ N^{α_{1}}$ and $η_{2} ≍ N^{α_{2}}$ for $α_{1}, α_{2} \in [0, 0.5)$ . We derive a sharp spectral characterization of the updated weights, demonstrating they behave as a spiked random matrix with multiple outliers, each corresponding to a learned direction. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.