Feature selection with gradient descent on two-layer networks in   low-rotation regimes

Matus Telgarsky

arXiv:2208.02789·cs.LG·August 5, 2022·1 cites

Feature selection with gradient descent on two-layer networks in low-rotation regimes

Matus Telgarsky

PDF

Open Access 1 Video

TL;DR

This paper proves low test error for gradient flow and SGD on two-layer ReLU networks in regimes where weights rotate little, using margin-based analysis to improve understanding of training dynamics and generalization.

Contribution

It introduces new analyses of gradient descent in low-rotation regimes, showing how margins are achieved and maintained, with implications for network width and sample complexity.

Findings

01

GF and SGD achieve margins proportional to NTK in near-initialization regime.

02

Analysis of entire GF trajectory in Neural Collapse setting.

03

Constrained weights with no rotation attain globally maximal margins.

Abstract

This work establishes low test error of gradient flow (GF) and stochastic gradient descent (SGD) on two-layer ReLU networks with standard initialization, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint), and making use of margins as the core analytic technique. The first regime is near initialization, specifically until the weights have moved by $O (m)$ , where $m$ denotes the network width, which is in sharp contrast to the $O (1)$ weight motion allowed by the Neural Tangent Kernel (NTK); here it is shown that GF and SGD only need a network width and number of samples inversely proportional to the NTK margin, and moreover that GF attains at least the NTK margin itself, which suffices to establish escape from bad KKT points of the margin objective, whereas prior work could only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Feature Selection with Gradient Descent on Two-layer Networks in Low-rotation Regimes· youtube

Taxonomy

TopicsNeural Networks and Applications · Model Reduction and Neural Networks · Stochastic Gradient Optimization Techniques

MethodsTest · Stochastic Gradient Descent · Neural Tangent Kernel