On regularization of gradient descent, layer imbalance and flat minima
Boris Ginsburg

TL;DR
This paper introduces a new metric called layer imbalance to analyze training dynamics in deep linear networks, revealing how regularization methods influence flat minima and demonstrating SGD's similarity to noise regularization.
Contribution
It proposes the layer imbalance metric, analyzes the two-phase training process, and extends the analysis to stochastic gradient descent, linking regularization and flat minima.
Findings
Regularization methods like weight decay and noise behave similarly.
Training occurs in two phases: optimization and regularization.
SGD acts similarly to noise regularization in the training process.
Abstract
We analyze the training dynamics for deep linear networks using a new metric - layer imbalance - which defines the flatness of a solution. We demonstrate that different regularization methods, such as weight decay or noise data augmentation, behave in a similar way. Training has two distinct phases: 1) optimization and 2) regularization. First, during the optimization phase, the loss function monotonically decreases, and the trajectory goes toward a minima manifold. Then, during the regularization phase, the layer imbalance decreases, and the trajectory goes along the minima manifold toward a flat area. Finally, we extend the analysis for stochastic gradient descent and show that SGD works similarly to noise regularization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Stochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis
MethodsWeight Decay · Stochastic Gradient Descent
