On the Regularization Effect of Stochastic Gradient Descent applied to   Least Squares

Stefan Steinerberger

arXiv:2007.13288·math.NA·September 3, 2020

On the Regularization Effect of Stochastic Gradient Descent applied to Least Squares

Stefan Steinerberger

PDF

TL;DR

This paper analyzes how stochastic gradient descent (SGD) applied to least squares problems exhibits a regularization effect, especially when the residual aligns with large singular vectors, leading to smoothing of solutions.

Contribution

The paper provides an explicit inequality showing the regularization effect of SGD on least squares problems, with extensions to symmetric matrices and Sobolev spaces.

Findings

01

SGD induces a regularization effect depending on the residual's singular vector composition.

02

The inequality reveals a smoothing energy cascade from large to small singular values.

03

Extensions to symmetric matrices demonstrate higher-order Sobolev space regularization.

Abstract

We study the behavior of stochastic gradient descent applied to $∥ A x - b ∥_{2}^{2} \to min$ for invertible $A \in R^{n \times n}$ . We show that there is an explicit constant $c_{A}$ depending (mildly) on $A$ such that $E ∥ A x_{k + 1} - b ∥_{2}^{2} \leq (1 + \frac{c _{A}}{∥ A ∥ _{F}^{2}}) ∥ A x_{k} - b ∥_{2}^{2} - \frac{2}{∥ A ∥ _{F}^{2}} A^{T} A (x_{k} - x)_{2}^{2} .$ This is a curious inequality: the last term has one more matrix applied to the residual $u_{k} - u$ than the remaining terms: if $x_{k} - x$ is mainly comprised of large singular vectors, stochastic gradient descent leads to a quick regularization. For symmetric matrices, this inequality has an extension to higher-order Sobolev spaces. This explains a (known) regularization phenomenon: an energy cascade from large singular values to small singular values smoothes.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsStochastic Gradient Descent