On regularization of gradient descent, layer imbalance and flat minima

Boris Ginsburg

arXiv:2007.09286·cs.LG·July 21, 2020·1 cites

On regularization of gradient descent, layer imbalance and flat minima

Boris Ginsburg

PDF

Open Access

TL;DR

This paper introduces a new metric called layer imbalance to analyze training dynamics in deep linear networks, revealing how regularization methods influence flat minima and demonstrating SGD's similarity to noise regularization.

Contribution

It proposes the layer imbalance metric, analyzes the two-phase training process, and extends the analysis to stochastic gradient descent, linking regularization and flat minima.

Findings

01

Regularization methods like weight decay and noise behave similarly.

02

Training occurs in two phases: optimization and regularization.

03

SGD acts similarly to noise regularization in the training process.

Abstract

We analyze the training dynamics for deep linear networks using a new metric - layer imbalance - which defines the flatness of a solution. We demonstrate that different regularization methods, such as weight decay or noise data augmentation, behave in a similar way. Training has two distinct phases: 1) optimization and 2) regularization. First, during the optimization phase, the loss function monotonically decreases, and the trajectory goes toward a minima manifold. Then, during the regularization phase, the layer imbalance decreases, and the trajectory goes along the minima manifold toward a flat area. Finally, we extend the analysis for stochastic gradient descent and show that SGD works similarly to noise regularization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Stochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis

MethodsWeight Decay · Stochastic Gradient Descent