The Quenching-Activation Behavior of the Gradient Descent Dynamics for Two-layer Neural Network Models
Chao Ma, Lei Wu, Weinan E

TL;DR
This paper investigates the gradient descent dynamics of two-layer neural networks, revealing a quenching-activation process that explains implicit regularization and differs from mean-field behavior, across different parameter regimes.
Contribution
It introduces a detailed phenomenological analysis of GD dynamics, highlighting the quenching-activation transition and its implications for neural network training behavior.
Findings
Identifies two phases in GD dynamics: quenched and activated neurons.
Shows a transition from neural network-like to random feature-like behavior.
Suggests the quenching-activation process as a mechanism for implicit regularization.
Abstract
A numerical and phenomenological study of the gradient descent (GD) algorithm for training two-layer neural network models is carried out for different parameter regimes when the target function can be accurately approximated by a relatively small number of neurons. It is found that for Xavier-like initialization, there are two distinctive phases in the dynamic behavior of GD in the under-parametrized regime: An early phase in which the GD dynamics follows closely that of the corresponding random feature model and the neurons are effectively quenched, followed by a late phase in which the neurons are divided into two groups: a group of a few "activated" neurons that dominate the dynamics and a group of background (or "quenched") neurons that support the continued activation and deactivation process. This neural network-like behavior is continued into the mildly over-parametrized regime,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Machine Learning and ELM · Stochastic Gradient Optimization Techniques
