Implicit bias of SGD in $L_{2}$-regularized linear DNNs: One-way jumps from high to low rank
Zihan Wang, Arthur Jacot

TL;DR
This paper investigates the implicit bias of stochastic gradient descent (SGD) in deep linear neural networks with $L_2$ regularization, showing that SGD tends to jump from higher to lower rank minima with zero probability of returning, influencing convergence to optimal solutions.
Contribution
It introduces a probabilistic framework demonstrating SGD's tendency to move from higher to lower rank minima in $L_2$-regularized deep linear networks, revealing a one-way jump behavior.
Findings
SGD can probabilistically jump from high to low rank minima.
The probability of jumping back from low to high rank minima is zero.
SGD's behavior is characterized by absorbing sets for different ranks.
Abstract
The -regularized loss of Deep Linear Networks (DLNs) with more than one hidden layers has multiple local minima, corresponding to matrices with different ranks. In tasks such as matrix completion, the goal is to converge to the local minimum with the smallest rank that still fits the training data. While rank-underestimating minima can be avoided since they do not fit the data, GD might get stuck at rank-overestimating minima. We show that with SGD, there is always a probability to jump from a higher rank minimum to a lower rank one, but the probability of jumping back is zero. More precisely, we define a sequence of sets so that contains all minima of rank or less (and not more) that are absorbing for small enough ridge parameters and learning rates : SGD has prob. 0 of leaving , and from any…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Sparse and Compressive Sensing Techniques · Face and Expression Recognition
MethodsStochastic Gradient Descent
