Feature selection with gradient descent on two-layer networks in low-rotation regimes
Matus Telgarsky

TL;DR
This paper proves low test error for gradient flow and SGD on two-layer ReLU networks in regimes where weights rotate little, using margin-based analysis to improve understanding of training dynamics and generalization.
Contribution
It introduces new analyses of gradient descent in low-rotation regimes, showing how margins are achieved and maintained, with implications for network width and sample complexity.
Findings
GF and SGD achieve margins proportional to NTK in near-initialization regime.
Analysis of entire GF trajectory in Neural Collapse setting.
Constrained weights with no rotation attain globally maximal margins.
Abstract
This work establishes low test error of gradient flow (GF) and stochastic gradient descent (SGD) on two-layer ReLU networks with standard initialization, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint), and making use of margins as the core analytic technique. The first regime is near initialization, specifically until the weights have moved by , where denotes the network width, which is in sharp contrast to the weight motion allowed by the Neural Tangent Kernel (NTK); here it is shown that GF and SGD only need a network width and number of samples inversely proportional to the NTK margin, and moreover that GF attains at least the NTK margin itself, which suffices to establish escape from bad KKT points of the margin objective, whereas prior work could only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Feature Selection with Gradient Descent on Two-layer Networks in Low-rotation Regimes· youtube
Taxonomy
TopicsNeural Networks and Applications · Model Reduction and Neural Networks · Stochastic Gradient Optimization Techniques
MethodsTest · Stochastic Gradient Descent · Neural Tangent Kernel
