How Does Gradient Descent Learn Features -- A Local Analysis for Regularized Two-Layer Neural Networks
Mo Zhou, Rong Ge

TL;DR
This paper provides a local convergence analysis of gradient descent in regularized two-layer neural networks, showing how features are learned both early and late in training, beyond the neural tangent kernel regime.
Contribution
It introduces a local convergence framework demonstrating feature learning at different training stages in regularized neural networks, extending beyond NTK limitations.
Findings
Gradient descent captures ground-truth directions after a loss threshold
Feature learning occurs both early and late in training
Regularization facilitates feature learning in neural networks
Abstract
The ability of learning useful features is one of the major advantages of neural networks. Although recent works show that neural network can operate in a neural tangent kernel (NTK) regime that does not allow feature learning, many works also demonstrate the potential for neural networks to go beyond NTK regime and perform feature learning. Recently, a line of work highlighted the feature learning capabilities of the early stages of gradient-based training. In this paper we consider another mechanism for feature learning via gradient descent through a local convergence analysis. We show that once the loss is below a certain threshold, gradient descent with a carefully regularized objective will capture ground-truth directions. We further strengthen this local convergence analysis by incorporating early-stage feature learning analysis. Our results demonstrate that feature learning not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications
MethodsNeural Tangent Kernel
